cxsoto / article-regions

A dataset of region-annotated scientific articles.
21 stars 5 forks source link

Why no such file or directory, when I have downloaded file ? #2

Open Pxtri2156 opened 3 years ago

Pxtri2156 commented 3 years ago

Hello everyone. When I download all articles. I got a bug below: ` 100 article IDs in pmc_ids.txt
Checking PubMed Central servers...
Request static : 200
...using US server.
Checking installed tools...
/usr/bin/curl
/usr/bin/pdfinfo ...tools ok. Checking PDF permissions for ImageMagick... /usr/bin/convert ...permissions ok.

Syntax Warning: May not be a PDF file (continuing anyway) .....................................................................
Syntax Error (353): Illegal character <2d> in hex string Dest: ./pdfs/PMC1538887.pdf Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538887/pdf Downloading https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538887/pdf ... Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table Traceback (most recent call last): File "download_and_render.py", line 88, in print('{}KB'.format(os.stat(dest).st_size // 1024)) FileNotFoundError: [Errno 2] No such file or directory: './pdfs/PMC1538887.pdf' Syntax Warning: May not be a PDF file (continuing anyway)

Syntax Error (353): Illegal character <2d> in hex string Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table Syntax Warning: May not be a PDF file (continuing anyway) Syntax Error (2): Illegal character <3f> in hex string Syntax Error (3): Illegal character <78> in hex string Syntax Error (4): Illegal character <6d> in hex string ..........................................................................
Syntax Error (353): Illegal character <2d> in hex string Syntax Error: Couldn't find trailer dictionary % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0PMC_dataset/article-regions$ 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0 100 17200 0 17200 0 0 7116 0 --:--:-- 0:00:02 --:--:-- 19713 `

This error: FileNotFoundError: [Errno 2] No such file or directory: './pdfs/PMC1538887.pdf'

Howardqlz commented 2 years ago

After running the .py file, i got this log:

100 article IDs in pmc_ids.txt
Checking PubMed Central servers...
...using US server.
Checking installed tools...
/usr/bin/curl
...tools ok.
Checking PDF permissions for ImageMagick...
...permissions ok.
Downloading https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1236913/pdf ... Traceback (most recent call last): File "download_and_render.py", line 82, in print('{}KB'.format(os.stat(dest).st_size // 1024)) FileNotFoundError: [Errno 2] No such file or directory: 'pdfs/PMC1236913.pdf' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0~/workspace/dla_dataset/article-regions> 0 0 0 0 0 0 0 0 --:--:-- 0:00:02 --:--:-- 0 100 17361 0 17361 0 0 6001 0 --:--:-- 0:00:02 --:--:-- 73563

and there is only one pdf in ./pdfs. What's the problem?

Howardqlz commented 2 years ago

Ok i change the download function and can download the right pdf files:

def download_pdf(url_list, dir_name):
    pmcid_file = 'pmc_ids.txt'
    pmcids = open(pmcid_file,'r').read().splitlines()
    url_pref = 'https://www.ncbi.nlm.nih.gov/pmc/articles/'
    url_post = '/pdf'
    url_list = []
    for id in pmcids:
        source = url_pref + id + url_post
        url_list.append(source)
    num = 0
    for url in url_list:
        num += 1
        url = url.replace(' ', '+')
        print(url)
        file_name = url.split('/')[-2]
        opener=urllib.request.build_opener()
        opener.addheaders=[('User-Agent','Mozilla/5.0 3578.98 Safari/537.36')]
        urllib.request.install_opener(opener)
        try:
            urllib.request.urlretrieve(urllib.parse.quote(url,safe=string.printable), filename=dir_name+file_name+'.pdf')
        except:
            print('cannot download!!!!!!!!')
            continue
    print('totally download:', num)

Maybe it will stop during running, just because of the internet situation.Hopes work for you