billgreenwald / Pubmed-Batch-Download

Batch download articles based on PMID (Pubmed ID)
MIT License
112 stars 45 forks source link

Damaged PDF & fetching stops #6

Closed nicolaycunha closed 6 years ago

nicolaycunha commented 6 years ago

Hi, I am trying to use the code with a couple of PMIDs, it is succeeding on downloading the pdfs, but they are coming damaged, and after 14 entries it gives the following error message:

Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if type(e)==requests.ConnectionError and '104' in e[0][1][0]: TypeError: argument of type 'int' is not iterable

After this message, the fetching is interrupted. Below are the PMIDs I am trying.

python fetch_pdfs.py -pmids 26633170,23682673,25040501,24628937,27174497,27547345,22610656,23858657,24998529,27859194,26991916,26742956,22268844,27547334,16299005,26658101,24458119,24850527,25859332,17522077,22739706,24628897,24232381,23127184,27329944,25480711,25253712,20574680,19333624,24131615,14761053,25704464,26507115,25754608,26655157,28308115,27551374,21777248,24372301,28568420,28309130,22711559,19874617,27777723,26199373,22680336,16004288,26949084,23624924,23339242,22074778,19763848,22666114,27680661,19324745,24138122,23603953,21833640,25002701,24933810,18724731,26070638,28312167,17750894,18707428,16670987,25664897,4066794,21546431,19663992,12803910,24800839,20636902,27038018,25948688,25165527,27648239,24266037,26482059,18593688,27146894,11222244,21636492,23002269,10860912,26987770,25002705,24743567,28311501,23294438,28310242,21237765,23134452,27870050,24372761,21653461,19704675,28565336,19367315,15271088,19910534,23963860,12858276,20576739,28564966,28565464,24287813,25272164,21484398,25347541,28313987,25130655,26817765,22151952,15255098,22652419,21134082,17652341,26573095,24766107,20408751,17711841,28313163,26578721,18289396,28547066,19131378,19121112,19324662,24317664,11080108,27767040,10205070,28310724,22805583,24193000,19412706,21642227,26878831,21632396,26421845,28309726,20592812,25903102,19218583,19001427,21789530,20345818,20047872,28310543,24464206,10568781,20676914,22438504,10431223,20954889,28547089,22519776,11607153,12659040,22156401,19429671,15596454,16371444,19398446,27851814,27714795,28307360,28308328,12437082,19654608,19050951,19516075,28593665,19153768,21636399,22476079,21170748,19126635,28312388,11539321,19218577,16615203,9299797,28565680,14652688,16133196,18637960,16866959,16593140,28564904,28568165,21669711,29673012,18761503,21669696,16866958,14551828,20961923,17879195,17416914,28312462,19443460,18707369,21755150,21636368,17427121,17300430,21665640,28698790,28309456,27864223,28312030,15696741,11222245,28311108,21642173,29880773,17203434,28877178,18426489,20952615,19739370,18031491,29134400,28568788,19158031,29280577,28313078,28428861,21653420,15696748,15280895,11353709,10860920,12207039,28626040,15212378,29532921,28204486,29765587,28960844,29658115,29346506,29468326,28904775,28428199,27915467,28798863,28135774,28647753,28861252,28822496,29947735,29917223,28079938,28504871,29464694,29893413,29878057,29878055,29882762,29445017 -maxRetries 3

Any thoughts are much appreciated

billgreenwald commented 6 years ago

From the log, do you know which pmid was running when the error happened?

nicolaycunha commented 6 years ago

below is the log output

*Trying to fetch pmid 26633170 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 26633170 succeeded Trying to fetch pmid 23682673 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 23682673 succeeded Trying to fetch pmid 25040501 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 25040501 succeeded Trying to fetch pmid 24628937 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 24628937 succeeded Trying to fetch pmid 27174497 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27174497 succeeded Trying to fetch pmid 27547345 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27547345 succeeded Trying to fetch pmid 22610656 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 22610656 succeeded Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 24998529 Reprint 24998529 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 26991916 Reprint 26991916 cannot be fetched as pubmed does not have a link to its pdf.

Here is the PMID

Trying to fetch pmid 26742956** Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if type(e)==requests.ConnectionError and '104' in e[0][1][0]: TypeError: argument of type 'int' is not iterable***

billgreenwald commented 6 years ago

The error handles fine for me, but I added better handling just in case. Can you give it a try? Also, what version of python and requests are you running?

Side note: I could fetch 24998529 and 26991916...not sure why yours is giving you that message. If you are familiar with python enough to add a print statement or two in specific places, let me know and we can debug that on your end.

billgreenwald commented 6 years ago

If the new code doesnt work, I just added a .yml with that can be used with anaconda to create an environment with the correct versions and packages needed to run the program.

nicolaycunha commented 6 years ago

Hi Bill,

I tried the new code and the .yml file with anaconda (but used it in a docker environment), however both error types persist, for instance the PMIDs 23682673, 22610656, 24628937, 25040501 came damaged. Please see below the output.


(base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 
Output directory of agora_vai did not exist.  Created the directory.
Trying to fetch pmid 25211280
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 25211280 succeeded
Trying to fetch pmid 26633170
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 26633170 succeeded
Trying to fetch pmid 23682673
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 23682673 succeeded
Trying to fetch pmid 25040501
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 25040501 succeeded
Trying to fetch pmid 24628937
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 24628937 succeeded
Trying to fetch pmid 27174497
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 27174497 succeeded
Trying to fetch pmid 27859194
** fetching of reprint 27859194 failed from error ('Connection aborted.', BadStatusLine("''",))
Trying to fetch pmid 22610656
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 22610656 succeeded
Trying to fetch pmid 23858657
 ** Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf.
Trying to fetch pmid 27547345
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 27547345 succeeded
Trying to fetch pmid 24998529
Trying genericCitationLabelled
Trying pubmed_central
Trying science_direct
** fetching reprint using the 'science_direct' finder...
** fetching of reprint 24998529 succeeded
Trying to fetch pmid 26482654
** fetching of reprint 26482654 failed from error ('Connection aborted.', BadStatusLine("''",))
Trying to fetch pmid 26991916
Trying genericCitationLabelled
** fetching reprint using the 'generic citation labelled' finder...
** fetching of reprint 26991916 succeeded
Trying to fetch pmid 26742956
Traceback (most recent call last):
  File "fetch_pdfs.py", line 252, in <module>
    if '104' in e[0][1][0]:
IndexError: tuple index out of range
(base) root@8914b8bb01b3:/data# 
billgreenwald commented 6 years ago

I just tried to download the pdfs that came damaged, but they came fine for me: are you logged into a system that has access to the journals? I am thinking that trying to download the file while not having access to the journal may be storing a non-pdf file as the output, which looks like a corrupted PDF.

Separately, I tried to add a new quick check to fix the error you are getting on the tuple index out of range. Let me know if that fixed it.

It looks like conda environments aren't platform agnostic, so not being able to install that package should be ok if you install the others, since its just a dependency needed for the others. Perhaps running conda install libgcc could do it?

nicolaycunha commented 6 years ago

Hi Bill,

I changed the network and now the pdfs are coming fine. However, after a few pdfs downloaded, the error appears. I am using a file with the PMIDs, and when error shows up, I remove the problematic PMID, but then the error occurs with a different PMID.

Regarding conda environment, I've tried conda install libgcc, but it could not find the library.

Below the error messages:

(base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Output directory of PDF did not exist. Created the directory. Trying to fetch pmid 25211280 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 25211280 succeeded Trying to fetch pmid 26633170 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 26633170 succeeded Trying to fetch pmid 23682673 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 23682673 succeeded Trying to fetch pmid 25040501 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 25040501 succeeded Trying to fetch pmid 24628937 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 24628937 succeeded Trying to fetch pmid 27174497 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27174497 succeeded Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 22610656 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 22610656 succeeded Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27547345 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27547345 succeeded Trying to fetch pmid 24998529 ** fetching of reprint 24998529 failed from error HTTPSConnectionPool(host='linkinghub.elsevier.com', port=443): Read timed out. (read timeout=5) Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 25211280 succeeded Trying to fetch pmid 26633170 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 26633170 succeeded Trying to fetch pmid 23682673 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 23682673 succeeded Trying to fetch pmid 25040501 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 25040501 succeeded Trying to fetch pmid 24628937 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 24628937 succeeded Trying to fetch pmid 27174497 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27174497 succeeded Trying to fetch pmid 27859194 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 22610656 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 22610656 succeeded Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27547345 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 27547345 succeeded Trying to fetch pmid 24998529 Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching reprint using the 'science_direct' finder... fetching of reprint 24998529 succeeded Trying to fetch pmid 26482654 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Reprint 27859194 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27547345 Reprint #27547345 already downloaded and in folder; skipping. Trying to fetch pmid 24998529 Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching reprint using the 'science_direct' finder... fetching of reprint 24998529 succeeded Trying to fetch pmid 26482654 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 27859194 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len()** (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27547345 Reprint #27547345 already downloaded and in folder; skipping. Trying to fetch pmid 26482654 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Reprint 23858657 cannot be fetched as pubmed does not have a link to its pdf. Trying to fetch pmid 27547345 Reprint #27547345 already downloaded and in folder; skipping. Trying to fetch pmid 26482654 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 23858657 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len() (base) root@8914b8bb01b3:/data# python fetch_pdfs.py -pmf PMID_all.txt -out PDF -maxRetries 3 Trying to fetch pmid 25211280 Reprint #25211280 already downloaded and in folder; skipping. Trying to fetch pmid 26633170 Reprint #26633170 already downloaded and in folder; skipping. Trying to fetch pmid 23682673 Reprint #23682673 already downloaded and in folder; skipping. Trying to fetch pmid 25040501 Reprint #25040501 already downloaded and in folder; skipping. Trying to fetch pmid 24628937 Reprint #24628937 already downloaded and in folder; skipping. Trying to fetch pmid 27174497 Reprint #27174497 already downloaded and in folder; skipping. Trying to fetch pmid 22610656 Reprint #22610656 already downloaded and in folder; skipping. Trying to fetch pmid 27547345 Reprint #27547345 already downloaded and in folder; skipping. Trying to fetch pmid 26991916 Trying genericCitationLabelled fetching reprint using the 'generic citation labelled' finder... fetching of reprint 26991916 succeeded Trying to fetch pmid 26742956 Traceback (most recent call last): File "fetch_pdfs.py", line 252, in if len(e) >=3 and '104' in e[0][1][0]: TypeError: object of type 'ConnectionError' has no len()**

nicolaycunha commented 6 years ago

After removing some PMIDs, this one generated a different error and went into a loop that I had to interrupt via keyboard.

Trying to fetch pmid 26655157 Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://? Trying genericCitationLabelled Trying pubmed_central Trying science_direct ** fetching of reprint 26655157 failed from error Invalid URL '': No schema supplied. Perhaps you meant http://?

billgreenwald commented 6 years ago

Apologies; I handled the bug fix incorrectly. I changed it and ran some test cases, which I think should work (though i still can't replicate your error with the int type, so I am not 100% sure).

I fixed the infinite loop, so that shouldn't be an issue any more. Finally, I wrote a new scraper for uChicagoPress, to grab that particular pdf.

nicolaycunha commented 6 years ago

Hi Bill, Sorry for the slow response, I was not able to test the code these days. I did run a short test and it seems that everything is working fine now. Many thanks for this! I will close this issue now.