Open javierelpianista opened 3 years ago
Hi Javier, I am getting a really similar problem because my scidownl does not work either. I am using the same code as you (scidownl -D < doi >). I checked line 112 and see that the AttributeError is caused by beautiful soup and refers to embedding that sci-hub does to papers, placing them in an iframe, then the private function below searches for the embedding and assigns it to the iframe variable. The problem lies in beautiful soup returning NoneType for iframe. they may have renamed iframe to something else. My new question lies with the html through sci hub.
def _search_direct_url(self, identifier):
"""
Sci-Hub embeds papers in an iframe. This function finds the actual
source url which looks something like https://moscow.sci-hub.io/.../....pdf.
"""
res = self.sess.get(self.base_url + identifier, verify=False)
s = self._get_soup(res.content)
iframe = s.find('iframe')
if iframe:
return iframe.get('src') if not iframe.get('src').startswith('//') \
else 'http:' + iframe.get('src')
Grace
It appears to me, that sci-hub does not use the frames anymore. They utilize divs, see the example below.
<div id="article">
<embed type="application/pdf" src="https://twin.sci-hub.se/6279/8a941ec16c0cd4c9ad1bf5ab29139335/ahmed2017.pdf#navpanes=0&view=FitH" id="pdf">
</div>
PR with the quickfix below.
simply replace
pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0]
with
pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0]
works for me
Ok thank you :)
On Mon, Oct 18, 2021 at 11:16 PM ddhecnu @.***> wrote:
simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with
pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Tishacy/SciDownl/issues/14#issuecomment-946399613, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZADI5MZTWCNYFPX3SBDEDUHUELFANCNFSM5B27DBYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
simply replace
pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0]
withpdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0]
works for me
Hi,I use python3.9 version , it occurs the same trouble like you mentioned above , and I followed your advise,however, it still didn't work @ddhecnu
That’s weird there must be some kind of a problem with the iframe. I’m not actually sure what the differences between 3.8 and 3.9. I commented an alternate solution with a different website on the get hub page so thanks for responding.
On Sat, Nov 13, 2021 at 12:59 AM TongZhou Tao @.***> wrote:
simply replace pdf_url = soup.find('iframe', {'id': 'pdf'}).attrs['src'].split('#')[0] with pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] works for me
Hi,I use python3.9 version , it occurs the same trouble like you mentioned above , and I followed your advise,however, it still didn't work @ddhecnu https://github.com/ddhecnu
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Tishacy/SciDownl/issues/14#issuecomment-967863853, or unsubscribe https://github.com/notifications/unsubscribe-auth/AIZADI2IZNO5JIOOUMGBMFTULYSGRANCNFSM5B27DBYA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
I still get the error in some articles even with iframe to embed change
Traceback (most recent call last): File "\main.py", line 20, in
download(DOIs) File "\main.py", line 13, in download SciHub(doi, out).download(choose_scihub_url_index=1) File "\scihub.py", line 88, in download pdf = self.find_pdf_in_html(res.text) File "\scihub.py", line 112, in find_pdf_in_html pdf_url = soup.find('embed', {'id': 'pdf'}).attrs['src'].split('#')[0] AttributeError: 'NoneType' object has no attribute 'attrs'
Great, what about a reproducible example? DOI maybe? I just randomly checked the sci-hub, and it seems fine. PDFs flourishing and resting in their embed lane.
My bad. The errors were occurring in articles not yet available on scihub. I hadn't realized that was the problem.
When I try to download an article using
I get the following error:
This didn't happen before. I am using Arch Linux, but also tried in a virtual machine with Linux Mint. Accessing SciHub manually and downloading the article works.