Closed abubelinha closed 12 months ago
@abubelinha thank you for opening this issue π
My assumption is this somehow being related to 3.9
version.
3.10
or 3.11
Python versions? I had issues with webdriver-manager
when using 3.9
.c:\python39\scripts
. For example, I've installed it C:\Workspace\Programming\SerpApi\python\env
(env
is a folder created using python -m venv env
and then source env/Scripts/activate
and then pip install scrape-google-scholar-py
) Apparently it installed with no problems.
How do you understand it? Just curious and trying to understand π Could you show the full installation output?
Examples on my end. Windows 10, Python 3.11 using a virtual environment (not installed globally).
Your example (no ImportError
):
βΊ python
Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from google_scholar_py import CustomGoogleScholarProfiles
>>> from google_scholar_py import SerpApiGoogleScholarOrganic
>>>
>>> exit()
Another command with the output:
βΊ python
Python 3.11.0 (main, Oct 24 2022, 18:26:48) [MSC v.1933 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from google_scholar_py import CustomGoogleScholarProfiles
>>> import json
>>>
>>> parser = CustomGoogleScholarProfiles()
>>> data = parser.scrape_google_scholar_profiles(
... query='blizzard',
... pagination=False,
... save_to_csv=False,
... save_to_json=False
... )
[WDM] - Downloading: 100%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 6.80M/6.80M [00:01<00:00, 6.64MB/s]
>>> print(json.dumps(data, indent=2))
[
{
"name": "Adam Lobel",
"link": "https://scholar.google.com/citations?hl=en&user=_xwYD2sAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"Gaming",
"Emotion regulation"
],
"email": "Verified email at AdamLobel.com",
"cited_by_count": 3791
},
{
"name": "Catherine A Blizzard",
"link": "https://scholar.google.com/citations?hl=en&user=vfPEiVUAAAAJ",
"affiliations": "",
"interests": null,
"email": null,
"cited_by_count": 1408
},
{
"name": "Daniel Blizzard",
"link": "https://scholar.google.com/citations?hl=en&user=dk4LWEgAAAAJ",
"affiliations": "",
"interests": null,
"email": null,
"cited_by_count": 1102
},
{
"name": "Shuo Chen",
"link": "https://scholar.google.com/citations?hl=en&user=OBf4YnkAAAAJ",
"affiliations": "Senior Data Scientist, Blizzard Entertainment",
"interests": [
"Machine Learning",
"Data Mining",
"Artificial Intelligence"
],
"email": "Verified email at cs.cornell.edu",
"cited_by_count": 744
},
{
"name": "Ian Livingston",
"link": "https://scholar.google.com/citations?hl=en&user=xBHVqNIAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"Human-computer interaction",
"User Experience",
"Player Experience",
"User Research",
"Games"
],
"email": "Verified email at usask.ca",
"cited_by_count": 659
},
{
"name": "Minli Xu",
"link": "https://scholar.google.com/citations?hl=en&user=QST5iogAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"Game",
"Machine Learning",
"Data Science",
"Bioinformatics"
],
"email": "Verified email at blizzard.com",
"cited_by_count": 557
},
{
"name": "Je Seok Lee",
"link": "https://scholar.google.com/citations?hl=en&user=vuvtlzQAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"HCI",
"Player Experience",
"Games",
"Esports"
],
"email": "Verified email at uci.edu",
"cited_by_count": 434
},
{
"name": "Alisha Ness",
"link": "https://scholar.google.com/citations?hl=en&user=xQuwVfkAAAAJ",
"affiliations": "Activision Blizzard",
"interests": null,
"email": null,
"cited_by_count": 351
},
{
"name": "Xingyu (Alfred) Liu",
"link": "https://scholar.google.com/citations?hl=en&user=VW9ukOwAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"Machine Learning in Game Development"
],
"email": "Verified email at andrew.cmu.edu",
"cited_by_count": 278
},
{
"name": "Amanda LL Cullen",
"link": "https://scholar.google.com/citations?hl=en&user=oqna6OgAAAAJ",
"affiliations": "Blizzard Entertainment",
"interests": [
"Games Studies",
"Fan Studies",
"Live Streaming"
],
"email": null,
"cited_by_count": 270
}
]
>>>
Thanks for your prompt reply.
As for the installation output:
c:>c:\python39\scripts\pip install scrape-google-scholar-py
Collecting scrape-google-scholar-py
Downloading scrape-google-scholar-py-0.2.27.tar.gz (35 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Collecting selectolax==0.3.12
Downloading selectolax-0.3.12-cp39-cp39-win_amd64.whl (1.9 MB)
---------------------------------------- 1.9/1.9 MB 5.2 MB/s eta 0:00:00
Collecting selenium-stealth==1.0.6
Downloading selenium_stealth-1.0.6-py3-none-any.whl (32 kB)
Collecting google-search-results>=2.4
Downloading google_search_results-2.4.2.tar.gz (18 kB)
Preparing metadata (setup.py) ... done
Collecting pandas>=1.5.3
Downloading pandas-2.0.1-cp39-cp39-win_amd64.whl (10.7 MB)
---------------------------------------- 10.7/10.7 MB 7.4 MB/s eta 0:00:00
Collecting parsel==1.7.0
Downloading parsel-1.7.0-py2.py3-none-any.whl (14 kB)
Requirement already satisfied: packaging in c:\python39\lib\site-packages (from parsel==1.7.0->scrape-google-scholar-py) (21.3)
Requirement already satisfied: lxml in c:\python39\lib\site-packages (from parsel==1.7.0->scrape-google-scholar-py) (4.7.1)
Collecting w3lib>=1.19.0
Downloading w3lib-2.1.1-py3-none-any.whl (21 kB)
Requirement already satisfied: cssselect>=0.9 in c:\python39\lib\site-packages (from parsel==1.7.0->scrape-google-scholar-py) (1.1.0)
Requirement already satisfied: Cython>=0.29.23 in c:\python39\lib\site-packages (from selectolax==0.3.12->scrape-google-scholar-py) (0.29.27)
Requirement already satisfied: selenium in c:\python39\lib\site-packages (from selenium-stealth==1.0.6->scrape-google-scholar-py) (3.141.0)
Requirement already satisfied: requests in c:\python39\lib\site-packages (from google-search-results>=2.4->scrape-google-scholar-py) (2.18.4)
Collecting tzdata>=2022.1
Downloading tzdata-2023.3-py2.py3-none-any.whl (341 kB)
---------------------------------------- 341.8/341.8 kB 10.7 MB/s eta 0:00:00
Requirement already satisfied: pytz>=2020.1 in c:\python39\lib\site-packages (from pandas>=1.5.3->scrape-google-scholar-py) (2021.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\python39\lib\site-packages (from pandas>=1.5.3->scrape-google-scholar-py) (2.8.2)
Requirement already satisfied: numpy>=1.20.3 in c:\python39\lib\site-packages (from pandas>=1.5.3->scrape-google-scholar-py) (1.21.5+vanilla)
Requirement already satisfied: six>=1.5 in c:\python39\lib\site-packages (from python-dateutil>=2.8.2->pandas>=1.5.3->scrape-google-scholar-py) (1.16.0)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\python39\lib\site-packages (from packaging->parsel==1.7.0->scrape-google-scholar-py) (3.0.6)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python39\lib\site-packages (from requests->google-search-results>=2.4->scrape-google-scholar-py) (3.0.4)
Requirement already satisfied: idna<2.7,>=2.5 in c:\python39\lib\site-packages (from requests->google-search-results>=2.4->scrape-google-scholar-py) (2.6)
Requirement already satisfied: certifi>=2017.4.17 in c:\python39\lib\site-packages (from requests->google-search-results>=2.4->scrape-google-scholar-py) (2021.10.8)
Requirement already satisfied: urllib3<1.23,>=1.21.1 in c:\python39\lib\site-packages (from requests->google-search-results>=2.4->scrape-google-scholar-py) (1.22)
Building wheels for collected packages: scrape-google-scholar-py, google-search-results
Building wheel for scrape-google-scholar-py (pyproject.toml) ... done
Created wheel for scrape-google-scholar-py: filename=scrape_google_scholar_py-0.2.27-py3-none-any.whl size=29185 sha256=40eaf39f199cc1d19e7882c9d88b52116f53bc3c64c6455f0ddf84fc44728df3
Stored in directory: c:\users\abu\appdata\local\pip\cache\wheels\64\2a\60\e0fb0cf78bc2dad32cf92494b208d15bc4e7e584d6f8088c69
Building wheel for google-search-results (setup.py) ... done
Created wheel for google-search-results: filename=google_search_results-2.4.2-py3-none-any.whl size=32017 sha256=cdafe96383fa594ca3f90f60e2220e1bd4562407e5801d8d9e9df0a590525979
Stored in directory: c:\users\abu\appdata\local\pip\cache\wheels\68\8e\73\744b7d9d7ac618849d93081a20e1c0deccd2aef90901c9f5a9
Successfully built scrape-google-scholar-py google-search-results
Installing collected packages: w3lib, tzdata, selectolax, selenium-stealth, parsel, pandas, google-search-results, scrape-google-scholar-py
Attempting uninstall: tzdata
Found existing installation: tzdata 2021.5
Uninstalling tzdata-2021.5:
Successfully uninstalled tzdata-2021.5
Attempting uninstall: pandas
Found existing installation: pandas 1.4.0
Uninstalling pandas-1.4.0:
Successfully uninstalled pandas-1.4.0
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
pycirclize 0.1.1 requires pandas<2.0.0,>=1.3.5, but you have pandas 2.0.1 which is incompatible.
Successfully installed google-search-results-2.4.2 pandas-2.0.1 parsel-1.7.0 scrape-google-scholar-py-0.2.27 selectolax-0.3.12 selenium-stealth-1.0.6 tzdata-2023.3 w3lib-2.1.1
[notice] A new release of pip is available: 23.0.1 -> 23.1.2
[notice] To update, run: c:\python39\python.exe -m pip install --upgrade pip
I know there is some kind of problem with my pycirclize package but that's an old installation I just did not remove.
I doubt that conflicts with your package installation: the final message is "Successfully installed google-search-results-2.4.2 pandas-2.0.1 parsel-1.7.0 scrape-google-scholar-py-0.2.27 selectolax-0.3.12 selenium-stealth-1.0.6 tzdata-2023.3 w3lib-2.1.1
".
I cannot try a higher Python version right now in that machine. But I will come back to this, maybe in a couple of months. I am definitely interested in grabbing Google Scholar info from Python. Not that much (just running a query twice a year or so).
I take the opportunity to ask about the possibilities of the Serpapi free tier. I am only interested in (2-3 times a year) finding out new papers citing some specific words (related to my institution) within article text. On average, when paginating Google Scholar web interface I use to find about 30 papers a year. Would that be possible to do automatically with a Free Serpapi Plan? There it says "100 searches/month", but I wonder what "one search" means. Would something like this (scholarly example call) account for just one search?
Thanks!
@abubelinha thank you for the additional details π
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts. pycirclize 0.1.1 requires pandas<2.0.0,>=1.3.5, but you have pandas 2.0.1 which is incompatible.
Just to add value to this. You can downgrade the pandas
package to a version 2.0.0
that is compatible with pycirclize
:
pip install pandas==1.3.5 # or any other version up to 2.0.0
Also, it's best to test in an isolated virtual environment so you don't have conflicts with other packages that is already installed on your machine.
python -m venv myenv
I cannot try a higher Python version right now in that machine.
π I think isolated environment + Python 3.10 should solve this issue. I can set up a Github reminder (via Octo Reminder) for you to have a look one more time. Let me know.
On average, when paginating Google Scholar web interface I use to find about 30 papers a year. Would that be possible to do automatically with a Free Serpapi Plan?
There it says "100 searches/month", but I wonder what "one search" means.
Yes, it is possible.
One search = one request. For example, when you do a regular Google Scholar search from your browser by hand, you type a query, hit enter and Google returned results. That's one request with returned results. Same with SerpApi. Hope this makes sense π
Would something like https://github.com/scholarly-python-package/scholarly/issues/208#issuecomment-718190945 (scholarly example call) account for just one search?
Yes, it would be a one search π
For example, when I change start
param (paginating to the next page), it would take a 1 search (which is 1 request), and if you go through 10 pages, that would equal a 10 searches (which is 10 requests):
And this is a search parameters that match scholarly.search_pubs(query = 'Ring Resonator', patents=True, citations=True, year_low=2010, year_high=2015)
:
# This is using SerpApi's Python wrapper instead of scrape-google-scholar-py
from serpapi import GoogleSearch
params = {
"api_key": "...",
"engine": "google_scholar", # your serpapi key, https://serpapi.com/manage-api-key
"q": "Ring Resonator",
"hl": "en", # language
"as_sdt": "7", # include patents
"as_ylo": "2010", # from year
"as_yhi": "2015", # to year
"start": "0" # page number (0 - first page, 10 - second...)
}
search = GoogleSearch(params)
results = search.get_dict()
publications = []
for result in results["organic_results"]:
publications.append(result) # appends all the data from the "organic_results" key
Let me know if it doesn't make sense π
Thanks for your detailed explanations !! I am not rushy for setting this up now. I'd prefer to wait until our IT staff upgrades machines probably next autumn. But I will definitely try all this.
A couple of questions I have though:
As I couldn't run your package I made some tests with scholarly. But I noticed GS output is truncated (see discussion here). So you actually need to launch several additional requests to get publication details for each of the 20 papers returned in "one request". I guess the same constraints apply using serpapi. Correct?
All this in order to get a simple "references list" which typical structure: full authors + year + title + journal/volume/issue/pages + url
So in order to produce that list for a single "original GS request" which returns 20 references, we would need at least one additional request for each reference? (or maybe more than one if we need to launch one request for full title, another for journal, another for authors ... I don't really know, just guessing)
So that a simple 20 references lists turns out to need: 1 request to get the original GS page with 20 references +20 requests to ensure you get full titles ? +20 requests to ensure you get full journal details ? +20 requests to ensure you get full authors ?
So at least 21 requests but probably 61 for getting full details? (or more, if I am missing something) And that's just for the 1st page of GS results.
Is this correct or would scrape-google-scholar-py / serpapi somehow include more detailed info in the original GS request, so we might not need to launch additional requests? (as per your sentence "scholarly only extract first 3 points")
Thanks a lot for your help
@abubelinha of course ππ
In my browser GS returns 20 papers/page, although I think that is configurable (you can choose 10 or 20, but not more). So I understand that is what we are getting with one serpapi search too.
"but not more" - it's indeed the case. It's a Google Scholar restriction and neither I nor SerpApi can bypass it.
But what do we get in one of those requests?
However, it only extracts first 3 points below". But I don't understand what you mean with that last sentence about 3 points.
"below" is a typo. I meant "above" π Visual example of what I meant (I'll update README, thank you π):
So you actually need to launch several additional requests to get publication details for each of the 20 papers returned in "one request". I guess the same constraints apply using serpapi. Correct?
Yes as every package (including third-party APIs) out there needs to make additional requests to Google which in return get data to you. In other words, if no additional requests are being made, there's no data to extract from. Not sure how to explain it better for now π
Just to clarify what you've already written but in a visual form:
Calculated without "link"
JSON key as it leads to an outside website (which SerpApi can't scrape) so 5 additional requests need to be done.
For 1 page with 20 results:
For 10 pages with 20 results per page:
For 20 pages with 20 results per page:
I think these are accurate calculations, I'm not really good at math though π
Is this correct or would scrape-google-scholar-py / serpapi somehow include more detailed info in the original GS request, so we might not need to launch additional requests?
Not 100% sure what you meant π
Both SerpApi and scrape-google-scholar-py
(and scholary
or other modules) extract 1:1 information that Google Scholar shows.
Perfectly explained and understood. I'll probably be back to this in a few months, but no more questions for now. Thanks a lot !!
I'll set a reminder for myself and for you. It will tag us in this thread on August 1. This is also to close the issue if it will be inactive for too long π
@set-reminder August 1 5am @abubelinha if have time, could you please have another look at this issue and try running on Python 3.10+ and see if you have the same error: ImportError: cannot import name 'SerpApiGoogleScholarOrganic' from 'google_scholar_py'? Thank you.
β° Reminder Tuesday, August 1, 2023 5:00 AM (GMT+01:00)
@abubelinha if have time, could you please have another look at this issue and try running on Python 3.10+ and see if you have the same error: ImportError: cannot import name 'SerpApiGoogleScholarOrganic' from 'google_scholar_py'? Thank you.
OK, don't worry. I expect not to be around by that time (holidays?). Whenever I have my OS upgraded I will come back to you if I can start trying again this package (probably not before Christmas because I use to be too busy at work until the end of autumn)
π @dimitryzub
@abubelinha if have time, could you please have another look at this issue and try running on Python 3.10+ and see if you have the same error: ImportError: cannot import name 'SerpApiGoogleScholarOrganic' from 'google_scholar_py'? Thank you.
I am using Windows 10, Python 3.9:
First I installed like this:
Apparently it installed with no problems. Then I tried to import:
Same error trying either
CustomGoogleScholarProfiles
orSerpApiGoogleScholarOrganic