Open Nearcyan opened 6 months ago
Is the blocking pattern known? If it's e.g. block after >10 requests in one minute, we could do 9 requests every 60 seconds to get a backlog going?
Nope, I didn't spend enough time on it to get that far.
Are you running a proxy for scraping tasks to get around the bans?
Google Scholar scraping is not currently active, and no proxy has been necessary for the arxiv scraping thus far.
The
--google_scholar
argument toscrape_abs.py
enables the script to grab author citation counts from Google Scholar to display on the frontend.Google quickly blocks these requests after we make too many, and the current proxy implementation is bad.
If it is replaced with a better one, we can have citation counts added to the entire system. It would also be nice to be able to go back and scrape citations for all past papers if possible.