Tishacy / SciDownl

An unofficial api for downloading papers from SciHub via DOI, PMID, title
MIT License
206 stars 43 forks source link

Getting AttributeError when downloading pdf #14

Open javierelpianista opened 3 years ago

javierelpianista commented 3 years ago

When I try to download an article using

scidownl -D <doi>

I get the following error:

File "/home/jgarcia/.local/lib/python3.9/site-packages/scidownl/scihub.py", line 112, in find_pdf_in_html
pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0]

AttributeError: 'NoneType' object has no attribute 'attrs'

This didn't happen before. I am using Arch Linux, but also tried in a virtual machine with Linux Mint. Accessing SciHub manually and downloading the article works.

grace-reed commented 3 years ago

Hi Javier, I am getting a really similar problem because my scidownl does not work either. I am using the same code as you (scidownl -D < doi >). I checked line 112 and see that the AttributeError is caused by beautiful soup and refers to embedding that sci-hub does to papers, placing them in an iframe, then the private function below searches for the embedding and assigns it to the iframe variable. The problem lies in beautiful soup returning NoneType for iframe. they may have renamed iframe to something else. My new question lies with the html through sci hub.

def _search_direct_url(self, identifier):
    """
    Sci-Hub embeds papers in an iframe. This function finds the actual
    source url which looks something like https://moscow.sci-hub.io/.../....pdf.
    """
    res = self.sess.get(self.base_url + identifier, verify=False)
    s = self._get_soup(res.content)
    iframe = s.find('iframe')
    if iframe:
        return iframe.get('src') if not iframe.get('src').startswith('//') \
            else 'http:' + iframe.get('src')

Grace

fridrichmrtn commented 3 years ago

It appears to me, that sci-hub does not use the frames anymore. They utilize divs, see the example below.

<div id="article">
        <embed type="application/pdf" src="https://twin.sci-hub.se/6279/8a941ec16c0cd4c9ad1bf5ab29139335/ahmed2017.pdf#navpanes=0&amp;view=FitH" id="pdf">
</div>

PR with the quickfix below.

16

ddh101 commented 3 years ago

simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me

grace-reed commented 3 years ago

Ok thank you :)

On Mon, Oct 18, 2021 at 11:16 PM ddhecnu @.***> wrote:

simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with

pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Tishacy/SciDownl/issues/14#issuecomment-946399613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZADI5MZTWCNYFPX3SBDEDUHUELFANCNFSM5B27DBYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

BaHole commented 2 years ago

simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me

Hi,I use python3.9 version , it occurs the same trouble like you mentioned above , and I followed your advise,however, it still didn't work @ddhecnu

grace-reed commented 2 years ago

That’s weird there must be some kind of a problem with the iframe. I’m not actually sure what the differences between 3.8 and 3.9. I commented an alternate solution with a different website on the get hub page so thanks for responding.

On Sat, Nov 13, 2021 at 12:59 AM TongZhou Tao @.***> wrote:

simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me

Hi,I use python3.9 version , it occurs the same trouble like you mentioned above , and I followed your advise,however, it still didn't work @ddhecnu https://github.com/ddhecnu

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Tishacy/SciDownl/issues/14#issuecomment-967863853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZADI2IZNO5JIOOUMGBMFTULYSGRANCNFSM5B27DBYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

PhelaPoscam commented 2 years ago

I still get the error in some articles even with iframe to embed change

Traceback (most recent call last): File "\main.py", line 20, in download(DOIs) File "\main.py", line 13, in download SciHub(doi, out).download(choose_scihub_url_index=1) File "\scihub.py", line 88, in download pdf = self.find_pdf_in_html(res.text) File "\scihub.py", line 112, in find_pdf_in_html pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] AttributeError: 'NoneType' object has no attribute 'attrs'

fridrichmrtn commented 2 years ago

Great, what about a reproducible example? DOI maybe? I just randomly checked the sci-hub, and it seems fine. PDFs flourishing and resting in their embed lane.

PhelaPoscam commented 2 years ago

My bad. The errors were occurring in articles not yet available on scihub. I hadn't realized that was the problem.

fridrichmrtn commented 2 years ago

image