PDF corrupt when downloaded with getpapers, O.K. when downloaded directly

ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query

MIT License

197 stars 37 forks source link

PDF corrupt when downloaded with getpapers, O.K. when downloaded directly #152

Closed sedimentation-fault closed 7 years ago

sedimentation-fault commented 7 years ago

I am totally new to getpapers, so this may be just a misunderstanding on my part of some of the inner workings of the ContentMine toolchain (even though it seems to be pretty straightforward...).

Here is the problem: when I use getpapers to download a paper, say with the command:

getpapers --query 'PMCID:"PMC5293196" JOURNAL:"PLOS ONE"' --outdir network-analysis -p

I get:

info: Searching using eupmc API
info: Found 1 open access results
warn: This version of getpapers wasn't built with this version of the EuPMC api in mind
warn: getpapers EuPMCVersion: 4.5.3.2 vs. 5.0.1 reported by api
Retrieving results [==============================] 100% (eta 0.0s)
info: Done collecting results
info: Saving result metadata
info: Full EUPMC result metadata written to eupmc_results.json
info: Individual EUPMC result metadata records written
info: Extracting fulltext HTML URL list (may not be available for all articles)
info: Fulltext HTML URL list written to eupmc_fulltext_html_urls.txt
info: Downloading fulltext PDF files
Downloading files [==============================] 100% (1/1) [0.0s elapsed, eta 0.0]
info: All downloads succeeded!

which looks pretty good to me (the warnings seem to be harmless in my case). Everything is there and the PDF seems to be in

PMC5293196/fulltext.pdf

However, when I try to open it in acroread, the reader tells me that the file is corrupt, that some font is missing etc. - and the paper is practically unreadable!

If I try to download it directly from the URL

http://europepmc.org/articles/PMC5293196?pdf=render

which is found in the _eupmcresults.json as the download URL for this ID, all is OK and the paper is displayed with absolutely no errors...

There is indeed a difference in the MD5 sums of the two versions:

md5sum PMC5293196/fulltext.pdf 
b815dff20f24ddbb035b4bfc3743d7f5  PMC5293196/fulltext.pdf
md5sum PMC5293196.pdf
5771a39765664bedb8ad8815b7980921  PMC5293196.pdf

The latter (PMC5293196.pdf) is the 'good' one, downloaded directly from http://europepmc.org/articles/PMC5293196?pdf=render with curl, the former (PMC5293196/fulltext.pdf) is the 'bad' one downloaded by getpapers.

It is not a locale issue (seems to me), because the same happens to a user with

LANG=en_US
LC_CTYPE="en_US"
LC_NUMERIC="en_US"
LC_TIME="en_US"
LC_COLLATE=C

and to a user with a non-US locale.

I have nodejs 4.4.6 and getpapers is installed with

npm install --global getpapers
npm WARN deprecated node-uuid@1.4.7: use uuid module instead
npm WARN deprecated tough-cookie@2.2.2: ReDoS vulnerability parsing Set-Cookie https://nodesecurity.io/advisories/130
/usr/bin/getpapers -> /usr/lib/node_modules/getpapers/bin/getpapers.js
getpapers@0.4.12 /usr/lib/node_modules/getpapers
├── progress@1.1.8
├── version_compare@0.0.3
├── commander@2.7.1 (graceful-readlink@1.0.1)
├── chalk@1.0.0 (escape-string-regexp@1.0.5, ansi-styles@2.2.1, supports-color@1.3.1, strip-ansi@2.0.1, has-ansi@1.0.3)
├── mkdirp@0.5.1 (minimist@0.0.8)
├── sanitize-filename@1.6.1 (truncate-utf8-bytes@1.0.2)
├── got@2.9.2 (lowercase-keys@1.0.0, timed-out@2.0.0, prepend-http@1.0.4, object-assign@2.1.1, is-stream@1.1.0, infinity-agent@2.0.3, statuses@1.3.1, nested-error-stacks@1.0.2, read-all-stream@2.2.0, duplexify@3.5.0)
├── matched@0.4.4 (fs-exists-sync@0.1.0, is-valid-glob@0.3.0, arr-union@3.1.0, async-array-reduce@0.2.1, extend-shallow@2.0.1, has-glob@0.1.1, glob@7.1.1, lazy-cache@2.0.2, resolve-dir@0.1.1)
├── winston@1.0.2 (cycle@1.0.3, stack-trace@0.0.9, eyes@0.1.8, isstream@0.1.2, async@1.0.0, pkginfo@0.3.1, colors@1.0.3)
├── restler@3.4.0 (yaml@0.2.3, qs@1.2.0, iconv-lite@0.2.11, xml2js@0.4.0)
├── lodash@3.10.1
├── xml2js@0.4.17 (sax@1.2.2, xmlbuilder@4.2.1)
├── requestretry@1.12.0 (extend@3.0.0, when@3.7.8, request@2.80.0, lodash@4.17.4)
└── crossref@0.1.2 (got@5.1.0, request@2.65.0)

What is going on here? Can you please shed some light?

sedimentation-fault commented 7 years ago

I have some new information for you:

First, this bug looks like a duplicate of

https://github.com/ContentMine/getpapers/issues/145

At least, as far as I can see, the symptoms are the same: corrupt PDFs.

Second, I have done some analysis of the contents of the 'good' and 'bad' PDFs of my original post above. The PDFs differ on characters outside the ASCII character set. This is an encoding issue and is known under the name of Mojibake (pronounced "moji-bah-keh"), see:

https://en.wikipedia.org/wiki/Mojibake#English

I say this because, for example, the "standard garbage bytes" that have to be near the beginning of each PDF according to the PDF specs, namely:

%âãÏÓ

appear correctly in the 'good' PDF, but are shown as:

%ï¿½ï¿½ï¿½ï¿½

in the 'bad' one. The same is true for lines like

(±)Tj

in the 'good' PDF that appear as

(ï¿½)Tj

in the 'bad' one.

I suspect that some tool inside the ContentMine toolchain enforces an 'unnatural' encoding. This encoding transforms non-ASCII characters and we get corruption.

Theoretically, if we knew what encoding is imposed by ContentMine and what is the original one, we could pass the encodings to iconv and transform the PDFs back to original. For example, if we knew that getpapers enforces "codepage 850" to a file that is actually UTF-8, then we could do

iconv -f CP850 -t UTF-8 corrupt.pdf > utf-8.pdf

and utf-8.pdf would be the original one (at least so I think). I have tried this with the following script:

#! /bin/bash

# Accepts two arguments:
# 
# - the filename of the PDF with the wrong encoding
# - the filename of the PDF with the correct encoding
# 
# Assuming that the correct encoding is UTF-8, the script
# tries all valid iconv encodings and compares the MD5 sums.
# If a MD5 sum matches the MD5 of the PDF with the correct 
# encoding, the script reports the source encoding.
# The purpose of this exercise in futility is to find out
# which encoding caused corruption in file_to_test.

file_to_test="$1"
md5_correct="$(md5sum "$2" | awk '{print $1}')"

b="$(basename "$file_to_test" .pdf)"

for a in $(iconv --list | sed 's/\/\/$//' | sort); do
  echo -n -e "$a\t"
  a_sanitized="$(echo "$a" | sed -e 's/[^0-9a-zA-Z]/-/g; s/--*/-/g')"
  iconv -f $a -t UTF-8 "$file_to_test" > "${b}-$a_sanitized.pdf" 2>&1
  if [ -e "${b}-$a_sanitized.pdf" ]; then
    md5="$(md5sum "${b}-$a_sanitized.pdf" | awk '{print $1}')"
    echo -e "$md5\t$md5_correct"
    if [ "$md5" = "$md5_correct" ]; then
      echo "### FOUND THE CULPRIT! IT IS ENCODING $a !"
      break
    else
      rm "${b}-$a_sanitized.pdf"
    fi
  fi
done

but it did not find anything...I think I'm close, but some detail escapes me...

sedimentation-fault commented 7 years ago

BTW, I installed nodejs 6.9.4, uninstalled getpapers and re-installed it. The warnings went away, but the above problem persists. It therefore does not have to do with any version incompatibilities.

petermr commented 7 years ago

Thanks for these issues. I can't help directly (@blahah wrote most of the code). I agree it looks like an encoding issue. I wonder if it's useful to find an Open Access PDF on the publisher's site and try that. That could show whether the encoding was changed on the EPMC site or whether it was more general.

sedimentation-fault commented 7 years ago

Is there any progress on this issue? It prevents me from using getpapers, as a significant proportion of PDFs is corrupt. Could you reproduce it? Why is noone assigned to it?

blahah commented 7 years ago

@sedimentation-fault sorry that this is blocking you. The reason that nobody is assigned is that I would probably be the right person, but am swamped with work right now and not funded to work on this. I promise to get to it as soon as I have time. Maybe @tarrow could take a look?

tarrow commented 7 years ago

I'll take a look at this

tarrow commented 7 years ago

A cursory check with your ID suggest that the problem is EuropePMC serving something odd. If I try to replicate with your command my fulltext.pdf actually contains (at the start):

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>502 Proxy Error</title>
</head><body>
<h1>Proxy Error</h1>
<p>The proxy server received an invalid
response from an upstream server.<br />
The proxy server could not handle the request <em><a href="/backend/ptpmcrender.${EXT_RENDER}">GET&nbsp;/backend/ptpmcrender.${EXT_RENDER}</a></em>.<p>
Reason: <strong>DNS lookup failure for: ${ppmc_host}</strong></p></p>

Obviously this is different from your pdf (with poor encoding).

I'll dig a little deeper; it may be that they have changed the location we should be getting PDF's from and the old one is failing (and sometimes giving poorly encoded pdfs). It may also be that they have a problem on their end and we are looking at the right place.

tarrow commented 7 years ago

Seems like we're looking in the correct place for the download url.

If you go to: http://www.ebi.ac.uk/europepmc/webservices/rest/search/query=PMCID%3A%22PMC5293196%22%20JOURNAL%3A%22PLOS%20ONE%22%20OPEN_ACCESS%3Ay&resulttype=core&pageSize=1000&cursorMark=*

You'll see the raw JSON query response. If you search around for "fullTextUrl" you'll find the link is supposedly: http://europepmc.org/articles/PMC5293196?pdf=render

If you try and load that then you'll get the 502 proxy error I posted above. I also get this if I curl the link.

As you can see I can't exactly replicate your bug; I would imagine the previous problems were as a result of something on EuPMCs end. Perhaps a proxy server mangling encoding for some UserAgents (I should check what ours is for the downloader) etc..

We don't do anything special to download, there is no part of getpapers that may enforce an 'unnatural encoding'. It should be just as it comes off the wire.

Unfortunately the problem is entirely on EuropePMC's end. I can suggest you contact EuropePMC; I'll send them an email as well. In the meantime (hopefully) downloading XML seems to work fine. This is probably more useful for you anyway than PDF.

Sorry I can't be of more help. Let me know if there is anything we can do to help.

tarrow commented 7 years ago

@sedimentation-fault If you have a chance could you confirm if this is now fixed for you?

It seems to work for me with the example query you gave. I can successfully read the pdf in okular.

petermr commented 7 years ago

Thanks to all of you, Just to confirm that EPMC have more than once changed their API without warning. We are wholly dependent on what they provided and do not get consulted before the event.

On Mon, Apr 3, 2017 at 1:38 PM, My Name notifications@github.com wrote:

@sedimentation-fault https://github.com/sedimentation-fault If you have a chance could you confirm if this is now fixed for you?

It seems to work for me with the example query you gave. I can successfully read the pdf in okular.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/152#issuecomment-291130541, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS0BKuJGQmrWXMIqtODmcU3pqIr1lks5rsOhRgaJpZM4MZcJM .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

sedimentation-fault commented 7 years ago

@tarrow , thanks for looking at this. I opened both PDFs in okular, the 'good' and the 'bad' one, as described in my original post above. Yes, okular displays both of them - BUT: there is a huge difference in display quality and fonts. A direct comparison of both files in okular shows that the PLOS logo is missing in the header, the font is different and many characters are simply displayed as garbage in the bad file.

The reason is the differences in the two files I pointed to above. You can replicate this bug by downloading the PDF (from that 'render' URL you gave) once with curl and once with getpapers, then comparing MD5 sums. You will see that, even if getpapers downloads from the right URL with the right API, the file it stores on the disk has the wrong MD5 sum.

For this reason, I think this bug should be left open.

Can you please point me to the files in the getpapers package that are directly involved in the download? I might try to trace the problem down - although there is no promise of actually locating it...

sedimentation-fault commented 7 years ago

The problem is in this code block

    var get = got(url, options, function(err, data, res) {
      dlprogress.tick();
      if (err) {
        if (err.code === 'ETIMEDOUT' || err.code === 'ESOCKETTIMEDOUT') {
          log.warn('Download timed out for URL ' + url);
        }
        if (!res) {
          failed.push(url);
        } else if ((res.statusCode == 404) && !(fourohfour === null)) {
          fourohfour();
        } else {
          failed.push(url);
        }
        cb();
      } else {
        fs.writeFile(base + rename, data, cb);
      }
    });

in file

_/usr/lib/nodemodules/getpapers/lib/eupmc.js

In simple words, you use the got module to download the PDF, passing the options in the options object. One of the options you may pass is encoding. It has the default value of utf8, unless you explicitly set it to something else.

For downloads of binary files, like PDFs, the encoding option must be explicitly set to null, otherwise the downloaded file will be corrupt. See

https://github.com/google/google-api-nodejs-client/issues/618

The link talks about the node.js request module, but since got is a lightweight replacement for request, the options object in got should have the same function as in request. In the eupmc.js file, encoding is set to null.

The curious thing is that, even if I change encoding to utf8, the downloaded file has the same (wrong) MD5 sum. I would expect that changing encoding would change the downloaded file (to be sure, I do delete the PDF, before I re-download it). Could it be that the options object is not passed at all to the got module? Is it a problem with the "callback function" cb? Or the fs.writeFile function?

Why does got download the file with an MD5 sum of

b815dff20f24ddbb035b4bfc3743d7f5

when the correct MD5 sum is

5771a39765664bedb8ad8815b7980921

for PMC5293196?

blahah commented 7 years ago

Great sleuthing, @sedimentation-fault ! I'll try to take a look tomorrow now that you've narrowed it down.

sedimentation-fault commented 7 years ago

PMC5293196.pdf fulltext.pdf

I attach the two versions ("good" and "bad") for PMC5293196.

PMC5293196.pdf: the good one, i.e. the one you get with curl. fulltext.pdf: the bad one, i.e. the one you get with getpapers.

If you open the two files in a pure text editor (say, in vi), you will notice that the differences are in parts like stream objects, i.e. in the binary parts. This explains why the PLOS icon does not appear in okular with the bad file. It is something that has to do with the encoding of the binary blobs of a PDF file.

Besides, the bad file is almost double in size, indicating that its binary parts have been encoded in a "bloating" way - base64?

I think I'm pretty close...Now it's your turn, dear @blahah ! :)

tarrow commented 7 years ago

I think you're really close to the problem but I think it actually exists in download.js. I'm not sure why 'got' is still being used in epmc.js (and only for supplementary info). Must be an oversight by me.

I think the solution probably comes in the same way: pass encoding: null as an option to request in download.js.

I'm currently travelling. If the wifi holds I'll commit a patch.

On Sun, Apr 9, 2017 at 11:56 PM, sedimentation-fault < notifications@github.com> wrote:

PMC5293196.pdf https://github.com/ContentMine/getpapers/files/908610/PMC5293196.pdf fulltext.pdf https://github.com/ContentMine/getpapers/files/908611/fulltext.pdf

I attach the two versions ("good" and "bad") for PMC5293196.

PMC5293196.pdf: the good one, i.e. the one you get with curl. fulltext.pdf: the bad one, i.e. the one you get with getpapers.

If you open the two files in a pure text editor (say, in vi), you will notice that the differences are in parts like stream objects, i.e. in the binary parts. This explains why the PLOS icon does not appear in okular with the bad file. It is something that has to do with the encoding of the binary blobs of a PDF file.

Besides, the bad file is almost double in size, indicating that its binary parts have been encoded in a "bloating" way - base64?

I think I'm pretty close...Now it's your turn, dear @blahah https://github.com/blahah ! :)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/152#issuecomment-292819618, or mute the thread https://github.com/notifications/unsubscribe-auth/AHA026_BUy40NDkXudVu50t3-o-pr_-Vks5ruWI6gaJpZM4MZcJM .

sedimentation-fault commented 7 years ago

@tarrow ,

you are right, the PDFs are downloaded in

/usr/lib/node_modules/getpapers/lib/download.js

I've checked it with a debug message - I added a line with a HELLO message:

function downloadURL(urlObj) {
  var url = urlObj.url;
  var id = urlObj.id;
  var type = urlObj.type;
  var rename = sanitize(urlObj.rename);
  var base = sanitize(id) + '/';
  log.debug('HELLO from requestretry module in download.js! :-)');
  log.debug('Creating directory: ' + base);
  mkdirp.sync(base);

and it showed up.

You fill the options object with the right options, but you don't pass it to the requestretry module. You have:

rq = requestretry.get({url: url, fullResponse: false} );

I replaced it with:

rq = requestretry.get({url: url, options, fullResponse: false} );

but the problem remains...

I start to suspect the callbacks that handle the download...

sedimentation-fault commented 7 years ago

I started writing a long account of what I think is (part of) the solution, when my browser decided to quit the job in the middle of writing... :-(

Here is my new attempt:

By "suspecting the callbacks" above, I meant the following: options seem to be correct and correctly passed to requestretry module - therefore, it cannot be them. If it is not the options, then it is the data and the way it is handled. This is done in the following code:

  function fileWriteCB(err) {
    if (err) throw error
    done()
  }

  function handleDownload(data) {
      fs.writeFile(base + rename, data, fileWriteCB);
      nextUrlTask(urlQueue);
    }

...

rq.then(handleDownload)

In my understanding, this means, in plain english, that the requestretry module shall use the handleDownload callback to pass the data in the data object to the fileWriteCB callback which, in turn, shall write it to disk as base/rename. The problem must be that the data object is passed as string between the callbacks and in the process (pass-by-value?) it is implicitly converted to some other encoding (base64?).

Welcome to node.js callback hell!

I have seen numerous examples of how to download a simple HTML page with node.js/http/request/requestretry and most of them seem to suggest that one should use either streams, as in The Node.js Request Module:

let fileStream = fs.createWriteStream('node.png');  
request('https://nodejs.org/static/images/logos/nodejs-new-white-pantone.png').pipe(fileStream);

or buffers, as in the following code from Nodejs http retry on timeout or error (which uses the http module):

function httpGet(url, callback) {
    var retry = function(e) {
        console.log("Got error: " + e.message);
        httpGet(url, callback); //retry
    }

    var req = http.get(url, function(res) {
        var body = new Buffer(0);
        res.on('data', function (chunk) {
            body = Buffer.concat([body, chunk]);
        });
        res.on('end', function () {
            if(this.complete) callback(body);
            else retry({message: "Incomplete response"});
        });
    }).on('error', retry)
    .setTimeout(20000, function(thing){
        this.socket.destroy();
    });
}

In both cases, there are advantages against the simple passing of data between callbacks inside an object. In the case of streams:

...using streams for large amounts of data (like files) can help reduce your app's memory footprint and response time. To make this easier to use, each of the request methods can pipe their output to another stream. (from: The Node.js Request Module)

and in the case of buffers you can easily transform the data to string (with the tostring() function of the buffer) and back.

So here is my explanation (didn't check the precise encoding details, but the overall picture is what is important to me right now):

In case of binary downloads (as all PDF downloads are), the server transforms the file contents to base64, before it sends it through the Internet, so you will have to transform it back (to utf8?) before your write it to disk. That's where streams and buffers come in handy. base64 transforms (more precisely: bloats) the non-ASCII parts of a string, leaving ASCII untouched - this is what we observe in the comparison of "good" and "bad" PDFs above. Currently, you pass the data 'as is' to the callbacks. It is passed by value, therefore transformed to a (utf8?) string and written to disk. This implicit transformation may be the right one - but in cases like the one in this bug it is not.

I'll leave it to you to sort out the exact details. :-)

sedimentation-fault commented 7 years ago

In the meantime, here is a workaround for the impatient:

Reading Using Node.js to download files, it became clear to me that one could use exec to just run curl as a subprocess of node - and let it do the dirty work!

Indeed, it works! Here's how:

In file

/usr/lib/node_modules/getpapers/lib/download.js

comment the requestretry download code as follows:

  // Commented. Causes PDFs to be downloaded corrupt, see bug #152:
  // https://github.com/ContentMine/getpapers/issues/152
  // 
  // // rq = requestretry.get({url: url, fullResponse: false} );
  // rq = requestretry.get({url: url, options, fullResponse: false} );
  // rq.then(handleDownload)
  // rq.catch(throwErr)
  //

and add, immediately thereafter:

  // Alternative method: use 'exec' to run 'curl -o ...'
  // Compose the curl command
  var mycurl = 'curl -o \'' + base + rename + '\' \'' + url + '\'';
  log.debug('Executing: ' + mycurl);
  // excute curl using child_process' exec function
  var child = exec(mycurl, function(err, stdout, stderr) {
      if (err) throw err;
      else log.debug(rename + ' downloaded to ' + base);
  });
  nextUrlTask(urlQueue);

That should also be the end of the function - there is only the closing curly bracket } after that.

Works like a charm! The downloaded PDF has now the correct MD5 sum! :-)

The only drawback of the current implementation is that one loses the interactive curl output (e.g. 'progress bar' etc.), but I am sure that the experts can devise a callback that catches it and logs it to the console or the logger device. ;-) We need a callback that also catches 'standard output' (stdout), not only 'standard error' (stderr) in POSIX speak.

To allow for retries and such, you may execute some wrapper to curl, instead of curl directly:

var mycurl = 'some-curl-wrapper -some-other-options -o \'' + base + rename + '\' \'' + url + '\'';

Enjoy! :-)

sedimentation-fault commented 7 years ago

The above workaround, in conjunction with the preceding observations, prove the following assertions:

This bug is not an EUPMC API problem, as suggested by PMR.
It can be resolved - the workaround is proof of this.
The problem definitely lies in the three commented lines (which, admittedly, do all the work), most probably in the handleDownload callback and the way it handles/encodes/passes data.
This bug is a 'severe' one - it corrupts the papers!
The bug should be re-opened and fixed.

sedimentation-fault commented 7 years ago

I forgot to mention: the above workaround will work 'as is' on Linux. One may have to tweak the 'compose the curl command' part to make it work on other operating systems.

tarrow commented 7 years ago

Hi,

Sorry, travel has prevented me responding sooner

Very interesting hack around. Actually I believe you were very close to solving the problem about 3 comments up. Have a look at the latest commit. I think this is a fix. Sorry I didn't post it sooner.

It was as you suggested; set encoding to null. See: http://stackoverflow.com/questions/14855015/getting-binary-content-in-node-js-using-request

I get the "good" md5 for the papers with this. More work needs to be done to be sure this doesn't break unicode xml downloads though.

petermr commented 7 years ago

Thanks to everyone who is contributing fixes and insights - it's really valuable. It's a difficult problem because we don't know in detail what is emitted by EPMC. Hopefully this will also help when we scrape other publishers.

On Thu, Apr 13, 2017 at 5:06 AM, My Name notifications@github.com wrote:

Hi,

Sorry, travel has prevented me responding sooner

Very interesting hack around. Actually I believe you were very close to solving the problem about 3 comments up. Have a look at the latest commit. I think this is a fix. Sorry I didn't post it sooner.

It was as you suggested; set encoding to null. See: http://stackoverflow.com/questions/14855015/getting- binary-content-in-node-js-using-request

I get the "good" md5 for the papers with this. More work needs to be done to be sure this doesn't break unicode xml downloads though.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ContentMine/getpapers/issues/152#issuecomment-293774294, or mute the thread https://github.com/notifications/unsubscribe-auth/AAsxS7XWZ4Z8_CyYlnmNM4thK5mWE2Z6ks5rvZ8tgaJpZM4MZcJM .

-- Peter Murray-Rust Reader Emeritus University of Cambridge +44-1223-763069 and ContentMine Ltd

sedimentation-fault commented 7 years ago

@petermr , EUPMC may have other insufficiencies that would fully justify bashing them - but this is not one of those. As tarrow can certainly explain to you in full detail, this is not EUPMC-specific - it affects all PDF downloads, no matter from which source. I am sure it will help with all publishers who serve binary content (of which PDF is only an instance). If you read the comments in the link given by tarrow two posts above, you will see that this is a very subtle bug that has made life hell to quite some people.

But it's gone now and that's really great news - treat yourself and tarrow to a cigar and a glass of wine (or, whatever your favourite beverage happens to be)! :+1:

@tarrow , thank you very much for the commit - that fixed it for sure! :+1:

One note however: I have noted that just doing

npm update --global getpapers

would NOT upgrade getpapers up to your commit - I had to apply it manually.

More work needs to be done to be sure this doesn't break unicode xml downloads though.

The standard procedure would be to check the HTTP response headers, filter either the 'Content disposition' line, to get a clue about the filetype, or the 'Content-Type' line, in search for 'application/pdf', or both - then set encoding to null only for the content types that need it.

A trick I have used sometimes is to send a GET request for the HTTP headers only, examine the headers, then send an appropriate GET a second time, with all headers/settings set to the right values. You can fetch the headers only by passing the -I option to curl.

And before I forget: you should check with the original poster of https://github.com/ContentMine/getpapers/issues/145 whether your commit fixes that bug too - I strongly suspect so. You can thus assign both bugs to the commit and change the status of both to CLOSED/FIXED. ;-)

sedimentation-fault commented 7 years ago

But - again - I did forget to ask something:

Why on earth did neither the old original line

rq = requestretry.get({url: url, fullResponse: false} );

nor my own attempt

rq = requestretry.get({url: url, options, fullResponse: false} );

manage to pass the options object (which already has encoding: null) to requestretry? Why is a form like

rq = requestretry.get({url: url, fullResponse: false, headers: {'User-Agent': config.userAgent}, encoding: null });

(as in the commit) required to make it work?

blahah commented 7 years ago

rq = requestretry.get({url: url, options, fullResponse: false} ); isn't valid javascript (except in ES6 maybe).

There's one argument to get in the examples you gave: the options object. You want to merge your options with what's passed to request. This should work:

rq = requestretry.get(Object.assign({url: url, fullResponse: false}, options));

sedimentation-fault commented 7 years ago

@blahah , Indeed, it works! :+1: Something new to learn, thank you! :-)

I derived my version from requestretry documentation, where it says:

request.get(url, options [, callback]) - same as request(options [, callback]), defaults options.method to GET.

From that I gathered that the form

requestretry.get(url, options, something)

should be a valid one. Obviously, for something='fullResponse: false' it is NOT (and I admit that 'something' like that does not look like a callback ;-)).

Curiously, I did not get an error - it just failed in a subtle way: it downloaded the PDF, failing to get the options object - resulting in a corrupt PDF. For someone like me who is used to get compiler warnings for hair-splitting 'issues', this is more than weird...

@tarrow, I suggest you use blahah's version. It looks cleaner to me to have an options object where all options are collected and pass that one to requestretry, instead of giving it extra options for user agent, encoding etc. each time it is called. Other than that, both versions resolve the bug for me.

Thank you all. :-)