Closed FerusAndBeyond closed 3 years ago
Coincidence! I asked the same thing about Pushshift here: https://www.reddit.com/r/pushshift/comments/ih66b8/difference_between_size_and_limit_and_are_they/
Hi,
I've tried to use a simple call to get 1000 entries (limit=1000) from a subreddit
list(api.search_submissions(subreddit="worldnews", limit=1000))
and as @FerusAndBeyond I only get 100 results and then it stops.
I have investigated the source code and found a possible issue. This might be related to #63 and #47 as well.
PushiftAPI.py lines 197 to 218
def _handle_paging(self, url):
limit = self.payload.get('limit', None)
#n = 0
while True:
if limit is not None:
if limit > self.max_results_per_request:
self.payload['limit'] = self.max_results_per_request
limit -= self.max_results_per_request
else:
self.payload['limit'] = limit
limit = 0
elif 'ids' in self.payload:
limit = 0
if len(self.payload['ids']) > self.max_results_per_request:
err_msg = "When searching by ID, number of IDs must be fewer than the max number of objects in a single request ({})."
raise NotImplementedError(err_msg.format(self.max_results_per_request))
self._add_nec_args(self.payload)
yield self._get(url, self.payload)
if (limit is not None) & (limit == 0):
return
This tries to perform as many request as needed for retrieving all of the desired data. The meaning of Limit for PSAW is different from the limits of the Pushshift API in the sense that PSAW tries multiple fetches to get close to the desired limit. The Pushshift API however will just take it as a suggestion for the current request. Therefore, it is calculated how many batches of "max_results_per_request" size are needed. This is then given to the pushshift API as "limit".
The issue is that it is not checked if "max_results_per_request" entries are actually returned by the API. The current default is 1000, which is an earlier max size the API will return. However, now it is 100. This means that the API will only 100 entries when PSAW thinks 1000 are returned.
My suggestion: Implement a check if the API returns the expected amount of entries and if not increase the "limit" variable by the missing amount. I have some code for that already, will post it tomorrow.
For now setting api = PushshiftAPI(max_results_per_request=100)
will solve the issue.
Also: Why is there (limit is not None) & (limit == 0)
and not (limit is not None) and (limit == 0)
?
Without "limit" PSAW will ignore "max_results_per_request" and just return whatever the API defaults to. EDIT: Not the case. This is handled already.
I hope my analysis helps :)
If I do a query such as
gen = api.search_submissions(score=">100", limit=1000)
then I get 100 results. How do I get as many as I specify?