Nearcyan / papers.day

papers.day
GNU General Public License v3.0
79 stars 4 forks source link

Citations cannot be scraped from Google Scholar without quickly being blocked #1

Open Nearcyan opened 6 months ago

Nearcyan commented 6 months ago

The --google_scholar argument to scrape_abs.py enables the script to grab author citation counts from Google Scholar to display on the frontend.

Google quickly blocks these requests after we make too many, and the current proxy implementation is bad.

If it is replaced with a better one, we can have citation counts added to the entire system. It would also be nice to be able to go back and scrape citations for all past papers if possible.

michaelskyba commented 6 months ago

Is the blocking pattern known? If it's e.g. block after >10 requests in one minute, we could do 9 requests every 60 seconds to get a backlog going?

Nearcyan commented 6 months ago

Nope, I didn't spend enough time on it to get that far.

Celestialchips commented 5 months ago

Are you running a proxy for scraping tasks to get around the bans?

Nearcyan commented 5 months ago

Google Scholar scraping is not currently active, and no proxy has been necessary for the arxiv scraping thus far.