WDscholia / scholia

Wikidata-based scholarly profiles
https://scholia.toolforge.org
Other
220 stars 78 forks source link

Get user data fails for Google Scholar #2218

Open fnielsen opened 1 year ago

fnielsen commented 1 year ago

Describe the bug Get user data fails for Google Scholar To Reproduce Steps to reproduce the behavior: python -m py.test --doctest-modules scholia/googlescholar.py

or

python -m scholia.googlescholar get-user-data 9cagBQYAAAAJ

Expected behavior No error. Data should be returned.

Additional context This also fails with tox.

fnielsen commented 1 year ago

The response from Google Scholar:

>>> response = requests.get(USER_URL, params={'user': user}, headers=HEADERS)
>>> response.content
b'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">\n<html DIR="LTR">\n<head><meta http-equiv="content-type" content="text/html; charset=utf-8"><meta name="viewport" content="initial-scale=1"><title>https://scholar.google.dk/citations?user=9cagBQYAAAAJ</title></head>\n<body style="font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;" onload="e=document.getElementById(\'captcha\');if(e){e.focus();} if(solveSimpleChallenge) {solveSimpleChallenge(,);}">\n<div style="max-width:400px;">\n<hr noshade size="1" style="color:#ccc; background-color:#ccc;"><br>\n<div style="font-size:13px;">\nVores systemer har registreret us\xc3\xa6dvanlig trafik fra dit computernetv\xc3\xa6rk. Pr\xc3\xb8v din anmodning senere <a href="#" onclick="document.getElementById(\'infoDiv\').style.display=\'block\';">Hvorfor er dette sket?</a><br><br>\n<div id="infoDiv" style="display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;">\nDenne side vises, n\xc3\xa5r Google automatisk registrerer anmodninger, der kommer fra dit computernetv\xc3\xa6rk, og som ser ud til at overtr\xc3\xa6de <a href="//www.google.com/policies/terms/">servicevilk\xc3\xa5rene</a>. Blokeringen udl\xc3\xb8ber kort tid efter, disse anmodninger holder op. <br><br>Denne trafik er blevet sendt af ondsindet software, et browser-plugin eller et script, der sender automatiske anmodninger. Hvis du deler din netv\xc3\xa6rksforbindelse, kan du bede din administrator om hj\xc3\xa6lp. Det kan m\xc3\xa5ske skyldes en anden computer, der bruger samme IP-adresse. <a href="//support.google.com/websearch/answer/86640">F\xc3\xa5 flere oplysninger</a><br><br>Du kan i nogle tilf\xc3\xa6lde f\xc3\xa5 vist denne side, hvis du bruger avancerede begreber, som robotter bruger, eller sender meget hurtige anmodninger.\n\n</div><br>\nIP-adresse: 192.38.90.52<br>Tid: 2023-01-05T13:27:43Z<br>Webadresse: https://scholar.google.dk/citations?user=9cagBQYAAAAJ<br>\n</div></div>\n</body>\n</html>\n'

"Vores systemer har registreret usædvanlig trafik fra dit computernetværk"

fnielsen commented 1 year ago

This problem seems not to occur with GitHub testing actions.

Daniel-Mietchen commented 1 year ago

https://support.google.com/websearch/answer/86640?hl=da

Seems Google have been getting too much Scholia traffic for their taste, so they are now blocking it.

Daniel-Mietchen commented 1 year ago

Here is what I get:

061     --------
062     >>> data = get_user_data('9cagBQYAAAAJ')
UNEXPECTED EXCEPTION: IndexError('list index out of range')
Traceback (most recent call last):
  File "/home/gitpod/.pyenv/versions/3.8.16/lib/python3.8/doctest.py", line 1336, in __run
    exec(compile(example.source, filename, "single",
  File "<doctest scholia.googlescholar.get_user_data[0]>", line 1, in <module>
  File "/workspace/scholia/scholia/googlescholar.py", line 93, in get_user_data
    'citations': int(citation_data[0]),
IndexError: list index out of range
/workspace/scholia/scholia/googlescholar.py:62: UnexpectedException

This indicates that citation_data is an empty list, which would fit with what you are getting.

fnielsen commented 1 year ago

But it is only from my computer (yet!).

egonw commented 1 year ago

I have the same problem, actually.