ckreibich / scholar.py

A parser for Google Scholar, written in Python
2.1k stars 777 forks source link

Parsing fails on PDF citations and empty results #116

Open bifxcore opened 5 years ago

bifxcore commented 5 years ago

It looks like the underlying HTML changed and the script is throwing: TypeError: slice indices must be integers or None or have an index method

I think I managed to fix it by changing the code around line 570 from:

            if str(tag).lower().find('.pdf'):
                if tag.find('div', {'class': 'gs_ttss'}):
                    self._parse_links(tag.find('div', {'class': 'gs_ttss'}))

to:

            if str(tag).lower().find('.pdf'):
                if isinstance(tag, NavigableString):
                    continue
                if isinstance(tag, Tag):                 
                    if tag.find('div', {'class': 'gs_or_ggsm'}):
                        self._parse_links(tag.find('div', {'class': 'gs_or_ggsm'}))
GianniSalami commented 5 years ago

How is NavigableString defined? Thank you for the fix!

bifxcore commented 5 years ago

How is NavigableString defined?

it needs to be imported from bs4 (BeautifulSoup library).


From: GianniSalami notifications@github.com Sent: 18 December 2018 10:26 To: ckreibich/scholar.py Cc: Seattle BioMed; Author Subject: Re: [ckreibich/scholar.py] Parsing fails on PDF citations and empty results (#116)

How is NavigableString defined? Thank you for the fix!

- You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_ckreibich_scholar.py_issues_116-23issuecomment-2D448320712&d=DwMCaQ&c=aBkXpkKi7gN5fe5MqrMaN-VmRugaRb1IDRfSv2xVRy0&r=wji2HRc6wNj6E_iDdlTq3VvbuGpzMddqJ0CgcExLMHEa1MZJM8LIAlikqG4pwOpR&m=iN6HV6mxWO3WWmduAa8AD6XUpGS8WKPHf8niaSpnLhQ&s=mByPq6DBVxAYTKtea__QJYIeVoR9yG7IxzN6I-oPaWE&e=, or mute the threadhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AGWM-5FLxi1-5Fiy1isu-2DT29vMJgaFh5yRYoks5u6TNCgaJpZM4ZSQR-5F&d=DwMCaQ&c=aBkXpkKi7gN5fe5MqrMaN-VmRugaRb1IDRfSv2xVRy0&r=wji2HRc6wNj6E_iDdlTq3VvbuGpzMddqJ0CgcExLMHEa1MZJM8LIAlikqG4pwOpR&m=iN6HV6mxWO3WWmduAa8AD6XUpGS8WKPHf8niaSpnLhQ&s=dT0h8-46bCD4A7A8G_kDul9rURyl7r82Hgq2XI9c3Es&e=.

CONFIDENTIALITY NOTICE: This e-mail message, including any attachments, is for the sole use of the intended recipient(s) and may contain confidential and privileged information protected by law. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message.

bifxcore commented 4 years ago

@peterzjx it still works for me (beautifulsoup4==4.3.2)

SvennoNito commented 4 years ago

Thank you so much @bifxcore ! It works for me now. Apparently one year later the bug still exists. for everybody who is as new to Beautiful Soup as me, the library needs to be imported like this:

from bs4 import NavigableString, Tag