Google Policy on Scraping Google Scholar

laucl commented 8 years ago

I know that google scholar imposes a query limit, but does it have any explicit policy prohibiting automated scraping of google scholar results? Applications like Harzing's Publish or Perish openly scrape google scholar and have been operating for years.

egavves commented 8 years ago

Yes, I also download it yesterday and I have a problem. In my case I was blocked even after 45 queries, with 30 + T (where T is a random number between 0-30) sec delays. This is weird and very annoying! I will try some of the solutions proposed in the other threads.

guicoelho commented 8 years ago

Also got blocked.

Google’s mission is to organize the world’s information and make it universally accessible and useful. -> right

ckreibich commented 7 years ago

Apologies for the glacially slow response here, folks — one thing you can do to help with the query limit is to use the --cookie-file option. Just add something like --cookie-file ~/.scholar-cookies.txt to your command line. That way, Scholar knows that your sequence of requests forms a particular session, which actually looks more like a real browser session. In my experience, this still triggers rate limiting at some point, but the limit becomes more generous.

pesho-ivanov commented 5 years ago

Can the US court ruling on legalizing scraping and forbidding from interfering the scraping (hiQ Labs v. LinkedIn, Sept 9, 2019) have implications on Google Scholar scraping? For example, by forcing Google Scholar demolish the scraping constraints (i.e. the query limitations) similarly to LinkedIn.

p.s. Excuse me if the Q is too general. please let me know a more suitable place for it if you know one.

wasified commented 5 years ago

Legally speaking, it's matter of what your intent is. For example, if you want to 'clone' Google Scholar by just crawling that would obviously be a 'legal' problem. However, scholar is a search engine and replication by crawling is a futile and zero-sum venture and I think it'll cost more in value than it'll deliver. Plus, Google's Scholar rate-limits very well and the crawling might look more like a DDos attack lol.

Also, I don't see why any legal proceeding would want 'query limitations' removed. Queries cost money to make, lifting those limits would be like asking oil companies to give out free petrol.

In terms of general scraping, as long are you rate-limit appropriately you should be fine I guess. I don't know much about the history of the case you mentioned, but citation managers routinely crawl Google Scholar. For example, if you rate limit close to human-esque levels I don't see why Google would cause a fuss. Zotero https://libraryguides.missouri.edu/c.php?g=27928&p=172240 searches Scholar to check a file's metadata against Scholar, and Slate Desktop http://slate.ink uses it to add a citation search engine to Word.

Lastly, Google Scholar looks like a dead-project within Google. I think they care less about it than you do.

On Wed, Sep 11, 2019 at 1:44 PM Pesho Ivanov notifications@github.com wrote:

Can the US court ruling on legalizing scraping and forbidding from interfering the scraping (hiQ Labs v. LinkedIn, Sept 9, 2019) have implications on Google Scholar scraping? For example, by forcing Google Scholar demolish the scraping constraints (i.e. the query limitations) similarly to LinkedIn.

p.s. Excuse me if the Q is too general. please let me know a more suitable place for it if you know one.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ckreibich/scholar.py/issues/71?email_source=notifications&email_token=ABQQX6K6KT77ZZ7RGS2M25DQJCVVTA5CNFSM4CKAXBH2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD6NX2HI#issuecomment-530283805, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQQX6OVA56EFP4WLMGWSNTQJCVVTANCNFSM4CKAXBHQ .

ckreibich / scholar.py

Google Policy on Scraping Google Scholar #71