dmarx / psaw

Python Pushshift.io API Wrapper (for comment/submission search)
BSD 2-Clause "Simplified" License
361 stars 53 forks source link

What is size and limit? What decides how many results I get? #79

Closed FerusAndBeyond closed 3 years ago

FerusAndBeyond commented 4 years ago

If I do a query such as gen = api.search_submissions(score=">100", limit=1000) then I get 100 results. How do I get as many as I specify?

reagle commented 4 years ago

Coincidence! I asked the same thing about Pushshift here: https://www.reddit.com/r/pushshift/comments/ih66b8/difference_between_size_and_limit_and_are_they/

Jabb0 commented 4 years ago

Hi, I've tried to use a simple call to get 1000 entries (limit=1000) from a subreddit list(api.search_submissions(subreddit="worldnews", limit=1000)) and as @FerusAndBeyond I only get 100 results and then it stops.

I have investigated the source code and found a possible issue. This might be related to #63 and #47 as well.

PushiftAPI.py lines 197 to 218

def _handle_paging(self, url):
    limit = self.payload.get('limit', None)
    #n = 0
    while True:
        if limit is not None:
            if limit > self.max_results_per_request:
                self.payload['limit'] = self.max_results_per_request
                limit -= self.max_results_per_request
            else:
                self.payload['limit'] = limit
                limit = 0
        elif 'ids' in self.payload:
            limit = 0
            if len(self.payload['ids']) > self.max_results_per_request:
                err_msg = "When searching by ID, number of IDs must be fewer than the max number of objects in a single request ({})."
                raise NotImplementedError(err_msg.format(self.max_results_per_request))
        self._add_nec_args(self.payload)

        yield self._get(url, self.payload)

        if (limit is not None) & (limit == 0):
            return

This tries to perform as many request as needed for retrieving all of the desired data. The meaning of Limit for PSAW is different from the limits of the Pushshift API in the sense that PSAW tries multiple fetches to get close to the desired limit. The Pushshift API however will just take it as a suggestion for the current request. Therefore, it is calculated how many batches of "max_results_per_request" size are needed. This is then given to the pushshift API as "limit".

The issue is that it is not checked if "max_results_per_request" entries are actually returned by the API. The current default is 1000, which is an earlier max size the API will return. However, now it is 100. This means that the API will only 100 entries when PSAW thinks 1000 are returned.

My suggestion: Implement a check if the API returns the expected amount of entries and if not increase the "limit" variable by the missing amount. I have some code for that already, will post it tomorrow. For now setting api = PushshiftAPI(max_results_per_request=100) will solve the issue.

Also: Why is there (limit is not None) & (limit == 0) and not (limit is not None) and (limit == 0)?

Without "limit" PSAW will ignore "max_results_per_request" and just return whatever the API defaults to. EDIT: Not the case. This is handled already.

I hope my analysis helps :)