Github source doesn't iterate through results

InQuest / ThreatIngestor

Extract and aggregate threat intelligence.

https://inquest.readthedocs.io/projects/threatingestor/

GNU General Public License v2.0

832 stars 135 forks source link

Github source doesn't iterate through results #47

Closed needmorecowbell closed 5 years ago

needmorecowbell commented 5 years ago

While the top of the request shows the number of entries, that is not how many are returned -- results are paginated. Adding the parameter per_page=100 sets the maximum return, and page= goes through all the results. At minimum we should be scraping the maximum of the page, however it's up for question whether we really want all the results if the query is vague

rshipp commented 5 years ago

Good catch. There are two cases to consider here:

First run. We don't have a saved_state. For Twitter and RSS, we can't feasibly fetch all results, so we just use whatever the "first page" looks like. We should do the same for GitHub search results. Increasing the per_page number is a good idea.
Subsequent runs. We have a saved_state, and we should do best effort to process every search result back until that state, even if its on a different page.

rshipp commented 5 years ago

Closed by #53