Open Pxtri2156 opened 3 years ago
After running the .py file, i got this log:
100 article IDs in pmc_ids.txt
Checking PubMed Central servers...
...using US server.
Checking installed tools...
/usr/bin/curl
...tools ok.
Checking PDF permissions for ImageMagick...
...permissions ok.
Downloading https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1236913/pdf ...
and there is only one pdf in ./pdfs. What's the problem?
Ok i change the download function and can download the right pdf files:
def download_pdf(url_list, dir_name):
pmcid_file = 'pmc_ids.txt'
pmcids = open(pmcid_file,'r').read().splitlines()
url_pref = 'https://www.ncbi.nlm.nih.gov/pmc/articles/'
url_post = '/pdf'
url_list = []
for id in pmcids:
source = url_pref + id + url_post
url_list.append(source)
num = 0
for url in url_list:
num += 1
url = url.replace(' ', '+')
print(url)
file_name = url.split('/')[-2]
opener=urllib.request.build_opener()
opener.addheaders=[('User-Agent','Mozilla/5.0 3578.98 Safari/537.36')]
urllib.request.install_opener(opener)
try:
urllib.request.urlretrieve(urllib.parse.quote(url,safe=string.printable), filename=dir_name+file_name+'.pdf')
except:
print('cannot download!!!!!!!!')
continue
print('totally download:', num)
Maybe it will stop during running, just because of the internet situation.Hopes work for you
Hello everyone. When I download all articles. I got a bug below: ` 100 article IDs in pmc_ids.txt
Checking PubMed Central servers...
Request static : 200
...using US server.
Checking installed tools...
/usr/bin/curl
/usr/bin/pdfinfo ...tools ok. Checking PDF permissions for ImageMagick... /usr/bin/convert ...permissions ok.
Syntax Warning: May not be a PDF file (continuing anyway) .....................................................................
print('{}KB'.format(os.stat(dest).st_size // 1024))
FileNotFoundError: [Errno 2] No such file or directory: './pdfs/PMC1538887.pdf'
Syntax Warning: May not be a PDF file (continuing anyway)
Syntax Error (353): Illegal character <2d> in hex string Dest: ./pdfs/PMC1538887.pdf Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538887/pdf Downloading https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1538887/pdf ... Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table Traceback (most recent call last): File "download_and_render.py", line 88, in
Syntax Error (353): Illegal character <2d> in hex string Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table Syntax Warning: May not be a PDF file (continuing anyway) Syntax Error (2): Illegal character <3f> in hex string Syntax Error (3): Illegal character <78> in hex string Syntax Error (4): Illegal character <6d> in hex string ..........................................................................
Syntax Error (353): Illegal character <2d> in hex string Syntax Error: Couldn't find trailer dictionary % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0PMC_dataset/article-regions$ 0 0 0 0 0 0 0 0 --:--:-- 0:00:01 --:--:-- 0 100 17200 0 17200 0 0 7116 0 --:--:-- 0:00:02 --:--:-- 19713 `
This error: FileNotFoundError: [Errno 2] No such file or directory: './pdfs/PMC1538887.pdf'