atlassian-api / atlassian-python-api

Atlassian Python REST API wrapper
https://atlassian-python-api.readthedocs.io
Apache License 2.0
1.33k stars 660 forks source link

[Confluence] Get all paginated CQL search results #1127

Open p-rinnert opened 1 year ago

p-rinnert commented 1 year ago

Hey,

I am using Confluence Cloud and have a CQL query that intentionally results in many results (>1000) in our Confluence space. The response of the Confluence API to my Confluence.cql() call is a paginated response, where the first 250 items (seems to be the max for our Cloud instance) are contained in the [results] and the next 250 results can be retrieved by calling the next link defined in ['_links']['next']. I was able to use the Confluence._get_paged() function in my code to retrieve all results but it seems that Confluence.cql() should (or at least could) return all results directly.

This is my solution as minimal example:

from atlassian import Confluence
import itertools

confluence = Confluence(url="https://MYCOMPANY.atlassian.net/",
                        username="name@company.com",
                        password="password")

cql_query = 'type=page' # CQL query with for all pages = many results

response = confluence.cql(cql, limit=250) # response contains first 250 results and link to the next 250 results

url_next = response.get('_links', {}).get('next') # get relative next result address

if url_next is not None:
        results_generator = itertools.chain(response.get('results'),
                                            confluence._get_paged(url=url_next))
        results = list(results_generator)
else:
        results = response.get('results', [])

Is there another way to directly get all results that I did not see yet? Or a more efficient or elegant? Confluence._get_paged() seems to be build for exactly this use case but is not integrated in Confluence.cql(). Could this be integrated into Confluence.cql()? I am new to the topic and therefore not sure about implications or expected behavior.

Thanks in advance Paul

Spacetown commented 1 year ago

The limit needs to be removed from the cql arguments and the method should call _get_paged instead of a normal get. Getting the whole list is not the intention for pageing. While you iterating over the first results the server can response to other requests and the server load is not so high as when you retrieving all functions at once.

p-rinnert commented 1 year ago

Thanks for your reply Michael. I agree that in an automated setting pageing like described by you with handling of the first batch of results before continuing with the second batch might make sense also to distribute load from Atlassian servers. In my case I excepted a complete list of results for my report and only got the first 250 (which is the max limit in our Confluence Cloud instance).

For my use case I changed the Confluence.cql() function in lines 2407ff (https://github.com/atlassian-api/atlassian-python-api/blob/master/atlassian/confluence.py#L2407) in the following to return a response containing all results of a query (if no start and no limit is set):

        try:
            response = self.get("rest/api/search", params=params)
            if start is None and limit is None:
                # If no limit or start is defined, get all results of query
                all_results = list(self._get_paged(url="rest/api/search", params=params))
                response["results"] = all_results
                response["limit"] = len(all_results)
                response["size"] = len(all_results)
                response["totalSize"] = len(all_results)
                del response["_links"]["next"]
        except HTTPError as e:
            if e.response.status_code == 400:
                raise ApiValueError("The query cannot be parsed", reason=e)
            raise
Spacetown commented 1 year ago

It should be enough to replace the first get with the get_paged.

cforce commented 1 year ago

Btw i hoped that actually start and limit would work as well, but the don't. It seems to ignore the start, because i always get back only the first 100/200 items

cql_query = 'lastModified >= "{updatedsince}" and type="page" and space = {space}'.format( updatedsince=updated_since, space=space_str, status=status) results = confluence.cql(cql=cql_query, start=start, limit=limit, expand='body.storage,history,space,' 'version')