Confluence get_all_pages_by_label does not respect `start` parameter, so pagination is broken

mikewyer commented 3 years ago

To reproduce: Attach a label to several pages. In our case, 54 pages.

batch1 = confluence.get_all_pages_by_label( "test_label", start=0, limit=50)
batch2 = confluence.get_all_pages_by_label( "test_label", start=50, limit=50)
assert(len(batch1) == 50) # true
assert(len(batch2) == 4) # false len(batch2) == 50

Whatever the setting for the start parameter, the function always returns the first limit number of results.

Our workaround is to keep increasing the limit if len(batch) == limit until all the results fit in a single batch. This won't work for labels on hundreds of pages.

jordangrodecki commented 2 years ago

Start is deprecated as a parameter - see here - https://developer.atlassian.com/cloud/confluence/change-notice-moderize-search-rest-apis/

It's been replaced by a cursor parameter which will define the 'next' page of results after the ones you have received.

The cursor param is included in a URL which is returned within the JSON of any paginated request, and can be supplied back to the API to pick up where you left off.

See my monkeypatched method below, which will add a 'cursor' value to every dict returned. The cursor can then be supplied back to get_all_pages_by_label to pick up the following page.

I am probably going to further monkeypatch to loop iteratively to get all values within the function, but others may not like that.

Note - the code here is awkward and very much slapped together. It can be done more elegantly.

https://github.com/atlassian-api/atlassian-python-api/issues/796 - This is a similar issue solved somewhat more elegantly.


def get_all_pages_by_label(self, label, start=0, limit=50, cursor=0):
        """
        Get all page by label
        :param label:
        :param start: OPTIONAL: The start point of the collection to return. Default: None (0).
        :param limit: OPTIONAL: The limit of the number of pages to return, this may be restricted by
                      fixed system limits. Default: 50
        :return:
        """
        url = "rest/api/content/search"
        params = {}
        if label:
            params["cql"] = 'type={type} AND label="{label}"'.format(type="page", label=label)
        if start:
            params["start"] = start
        if limit:
            params["limit"] = limit
        if cursor:
            params["cursor"] = cursor

        try:
            print()
            response = self.get(url, params=params)
        except HTTPError as e:
            if e.response.status_code == 400:
                raise ApiValueError("The CQL is invalid or missing", reason=e)

            raise

        text = response['_links']['next']
        cursor_loc= text.find('cursor') + 7
        cursor_string = text[cursor_loc:text[cursor_loc:].find('&')+cursor_loc]
        results_response = response.get("results")
        for item in results_response: item['next_cursor']= cursor_string

        return response.get("results")

Spacetown commented 2 years ago

In the BB modules I've introduced a function: https://github.com/atlassian-api/atlassian-python-api/blob/6cd6dd047b09fad1f43f4f2d42a485c96b7d1e84/atlassian/bitbucket/cloud/base.py#L35-L37 https://github.com/atlassian-api/atlassian-python-api/blob/6cd6dd047b09fad1f43f4f2d42a485c96b7d1e84/atlassian/bitbucket/server/base.py#L20 This function is used in the paged APIs and returns a generator to iterate throug the result. Maybe it's possible to merge this functions and add it to rest_clipent.py. The problem is to implement a generic way to get the next url for every atlassion product.

jordangrodecki commented 2 years ago

Going to spend a bit more time on it today and see if there's a generic solution for all stuff confluence side and compare to what you've done for BB.

Edit: Looks like the response is not the same for BB as it is for confluence - there's no pages param here. Going to experiment and see where cursor applies and see if we can make it more generic, or whether it is going to require edits to every method

For now - a cleaner monkey patch.

def get_all_pages_by_label(self, label, start=0, limit=50):
        """
        Get all page by label
        :param label:
        :param start: OPTIONAL: The start point of the collection to return. Default: None (0).
        :param limit: OPTIONAL: The limit of the number of pages to return, this may be restricted by
                      fixed system limits. Default: 50
        :return:
        """
        url = "rest/api/content/search"
        params = {}
        if label:
            params["cql"] = 'type={type} AND label="{label}"'.format(type="page", label=label)
        if start:
            params["start"] = start
        if limit:
            params["limit"] = limit

        try:
            response = self.get(url, params=params)
        except HTTPError as e:
            if e.response.status_code == 400:
                raise ApiValueError("The CQL is invalid or missing", reason=e)
            raise

        while response.get('_links').get('next'):
            response_results = response.get('results')
            params['cursor'] = response.get('_links').get('next')
            response = self.get(response.get('_links').get('next'), params=params)
            response_results = response_results + response.get('results')

        if response_results:
            results = response_results
        else: 
            results = response.get('results')

        return results

ZaxR commented 1 year ago

While a bit hacky, my temporary solution to this is to just set limit really, really high. In my case, I'm using labels to summarize pages with a label in a "page properties report." Confluence only supports 3,000 pages in that report, so I just set that as the limit:

pages_with_label = confluence.get_all_pages_by_label(label, limit=3000)

Note: I only tested this up to 75 pages, so maybe there's a limit for a single call, but I didn't see such a limit documented anywhere.

jordangrodecki commented 1 year ago

There definitely is a limit - I forget what it is but it is higher than 75 but lower than 3000 :)

AndreasWintherMoen commented 2 months ago

How is this not fixed? The start parameter has been deprecated for 3 years!

atlassian-api / atlassian-python-api

Confluence get_all_pages_by_label does not respect `start` parameter, so pagination is broken #820