ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Corrupted / blank page PDF downloads #145

Open rossmounce opened 7 years ago

rossmounce commented 7 years ago

Very bizarre. Getpapers appears to be downloading PDF files of the right size for me (they are not 0-byte files) but when I open them there are completely blank. Blank pages. The right number of pages, but just completely blank. Nor is it a problem with my local PDF viewing software: cloud PDF viewing services also show that these PDF files are seemingly blank pages despite MB file sizes.

I have zipped up the entire output project folder so you can inspect the files yourself (only 12 'hits' for the search): https://github.com/rossmounce/tmpfilestorage/raw/master/testaardvark.zip

ross@ross-envy:~/workspace/contentmine/teststuff$ node --version
v4.0.0
ross@ross-envy:~/workspace/contentmine/teststuff$ npm version
{ npm: '3.10.8',
  ares: '1.10.1-DEV',
  http_parser: '2.5.0',
  modules: '46',
  node: '4.0.0',
  openssl: '1.0.2d',
  uv: '1.7.3',
  v8: '4.5.103.30',
  zlib: '1.2.8' }
ross@ross-envy:~/workspace/contentmine/teststuff$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 16.04 LTS
Release:    16.04
Codename:   xenial
ross@ross-envy:~/workspace/contentmine/teststuff$ getpapers -V
0.4.10
ross@ross-envy:~/workspace/contentmine/teststuff$ getpapers -q 'aardvark AND FIRST_PDATE:[2016-01-01 TO 2016-12-01]' -o testaardvark --pdf
info: Searching using eupmc API
info: Found 12 open access results
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [=======================] 100% (12/12) [1.6s elapsed, eta 0.0]
info: All downloads succeeded!
ross@ross-envy:~/workspace/contentmine/teststuff$ tree testaardvark
testaardvark
├── eupmc_fulltext_html_urls.txt
├── eupmc_results.json
├── PMC4731086
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4798954
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4841245
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4920337
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4924314
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4965448
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4973251
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC4982594
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5025827
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5028775
│   ├── eupmc_result.json
│   └── fulltext.pdf
├── PMC5061548
│   ├── eupmc_result.json
│   └── fulltext.pdf
└── PMC5089389
    ├── eupmc_result.json
    └── fulltext.pdf

12 directories, 26 files
petermr commented 7 years ago

are they all from EPMC? and is this all the files or are some correct?

I have downloaded PMC4841245 and it gives a PDF of 38 Mbytes which doesn't open. So it looks like there is a corruption somewhere.

petermr commented 7 years ago

The header shows it to be a PDF:

"fulltext.pdf" may be a binary file.  See it anyway? 
%PDF-1.5
%����
28 0 obj
<<
/Length 3925      
/Filter /FlateDecode
>>
stream
rossmounce commented 7 years ago

Are they all from EUPMC? Yes.

The header shows it to be a PDF: Yes.

Some of the PDFs open (for me with evince), and the correct number of pages are shown, some are 3 pages, some are say 27 pages. But all the pages are white/blank. Ordinarily I would assume that this is something wrong with my local PDF viewing software, so I also tried viewing these getpaper downloaded files in the cloud. The cloud software also "sees" them as blank pages, therefore the problem is real I think.

I have this problem on two independent machines too. Reproducible.

It's not just that specific query either.

Other EUPMC API queries (this one with just 3 open access hits) also give the same problem:

getpapers -q 'Gasteria AND FIRST_PDATE:[2015-01-01 TO 2016-08-20]' -o gasteria --pdf

The downloading of fulltext XML (--xml) and SI (--supp) is unaffected/working fine.

This bug also affects PDFs downloaded from the arxiv API. I tried both sample queries, both of which return corrupted PDFs, all the same size ~2.1kb:

getpapers --api arxiv --query 'all:transcriptome' -o arxiv --pdf
getpapers --api arxiv --query 'au:"del maestro" AND ti:checkerboard' -o arxiv --pdf 
rossmounce commented 7 years ago

Just to say, I also appear to be getting blank page PDFs in Windows 8.1 getpapers too. This problem is not confined to linux installations.

sedimentation-fault commented 7 years ago

For those who still have this issue: take a look at https://github.com/ContentMine/getpapers/issues/152 and the commit https://github.com/ContentMine/getpapers/commit/99b93d857470f7a9aeb344b4ede9273044bfef7e that resolved it - it may help resolve this bug too. Please give it a try and post your experiences.