Open rossmounce opened 7 years ago
are they all from EPMC? and is this all the files or are some correct?
I have downloaded PMC4841245 and it gives a PDF of 38 Mbytes which doesn't open. So it looks like there is a corruption somewhere.
The header shows it to be a PDF:
"fulltext.pdf" may be a binary file. See it anyway?
%PDF-1.5
%����
28 0 obj
<<
/Length 3925
/Filter /FlateDecode
>>
stream
Are they all from EUPMC? Yes.
The header shows it to be a PDF: Yes.
Some of the PDFs open (for me with evince
), and the correct number of pages are shown, some are 3 pages, some are say 27 pages. But all the pages are white/blank. Ordinarily I would assume that this is something wrong with my local PDF viewing software, so I also tried viewing these getpaper
downloaded files in the cloud. The cloud software also "sees" them as blank pages, therefore the problem is real I think.
I have this problem on two independent machines too. Reproducible.
It's not just that specific query either.
Other EUPMC API queries (this one with just 3 open access hits) also give the same problem:
getpapers -q 'Gasteria AND FIRST_PDATE:[2015-01-01 TO 2016-08-20]' -o gasteria --pdf
The downloading of fulltext XML (--xml
) and SI (--supp
) is unaffected/working fine.
This bug also affects PDFs downloaded from the arxiv API. I tried both sample queries, both of which return corrupted PDFs, all the same size ~2.1kb:
getpapers --api arxiv --query 'all:transcriptome' -o arxiv --pdf
getpapers --api arxiv --query 'au:"del maestro" AND ti:checkerboard' -o arxiv --pdf
Just to say, I also appear to be getting blank page PDFs in Windows 8.1 getpapers
too. This problem is not confined to linux installations.
For those who still have this issue: take a look at https://github.com/ContentMine/getpapers/issues/152 and the commit https://github.com/ContentMine/getpapers/commit/99b93d857470f7a9aeb344b4ede9273044bfef7e that resolved it - it may help resolve this bug too. Please give it a try and post your experiences.
Very bizarre. Getpapers appears to be downloading PDF files of the right size for me (they are not 0-byte files) but when I open them there are completely blank. Blank pages. The right number of pages, but just completely blank. Nor is it a problem with my local PDF viewing software: cloud PDF viewing services also show that these PDF files are seemingly blank pages despite MB file sizes.
I have zipped up the entire output project folder so you can inspect the files yourself (only 12 'hits' for the search): https://github.com/rossmounce/tmpfilestorage/raw/master/testaardvark.zip