Netflix-Skunkworks / Scumblr

Web framework that allows performing periodic syncs of data sources and performing analysis on the identified results
Apache License 2.0
2.64k stars 317 forks source link

Results from Google CSE appear artificially limited #173

Closed jwilczek closed 7 years ago

jwilczek commented 7 years ago

I am using a Google CSE. When entering a query on the Google CSE site directly, I get 14,000 results. When using the Google Search query from inside of Scumblr I get around 47 results.

Is there a reason for this discrepancy?

jwilczek commented 7 years ago

Updated the title of this issue to reflect the fact that I'm seeing the same issue with Twitter searches.

jwilczek commented 7 years ago

Updating...

In some instances, I get 0 results from new Google Search Tasks, despite my CSE on Google showing 100,000s.

ahoernecke commented 7 years ago

Hi @jwilczek, you're they are both limited to 100 because that's the max results for a single page. Could be updated to support pagination. Originally this was done to help prevent exceeding rate limits since each page counts as an API call.

Re: your most recent comment, can you provide an example query that's returning 0 via Scumblr but not against the CSE directly? Feel free to ping me directly on gitter if that would be preferred.

jwilczek commented 7 years ago

So, does this mean that if I want to see new results as they happen, I should configure my CSE to show results sorted by time (not relevance)? It seems like for a result to be picked up by Scumblr it has to be in the top 100. Is this correct?

Also, I'll message you with specifics in gitter when I get out of my 9:00 meeting.

ahoernecke commented 7 years ago

For reference, the google search provider limits max results to 100. This can be changed in the google search provider by adjusted this line to allow higher values.

@max_results = @max_results > 100 ? 100 : @max_results

Note: 100 results counts as 10 api calls against Google's quota. Each 10 additional will count as an additional query. This is due to the limit imposed by the Google API.