ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Resume (-r) does not work the same with arxiv, as with EUPMC API #158

Open sedimentation-fault opened 7 years ago

sedimentation-fault commented 7 years ago

Description

While appending '-r' to a getpapers command will not retry to download PDFs that are already there for queries with the EUPMC API, the same is NOT the case with the arxiv API - it not only retries the metadata query, but also each and every PDF, from the first to the last one.

This is very annoying if you have 24000 PDFs to download, of which a few thousand have already been downloaded in previous attempts.

How to reproduce

Run

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p

Let it download some number of PDFs, then interrupt it and retry with

getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p -r

Side effects

An annoying side-effect of this is that the web server will reply with a

416 Range Not Satisfiable

error, if the downloader (be it the standard requestretry from master code, or a curl wrapper from a custom fork, as in https://github.com/ContentMine/getpapers/issues/157) tries a resumable download (i.e. if it tries to resume the download from where it stopped, in this case from the end of an already completely downloaded file, a byte range whose start is certainly not satisfiable). If the error is not caught, the script will stop at the first file that is already there, with the above 416 error.

This side effect may not occur in the master code (or it may appear as a 416 caught error, I didn't check).

blahah commented 7 years ago

@sedimentation-fault at this point I think it's worth saying that you've discovered enough bugs that it's clear we need to rewrite some fundamentals. We knew about some of these internally for some time, but haven't had the resources to resolve them. If you could refrain from reporting new issues for now that would be helpful - the existing reports have enough detail that we have a good set of goals to test against when rewriting. At this point more issues just gives us more admin to do afterwards. We really appreciate the detailed analysis so far though :)

sedimentation-fault commented 7 years ago

The good thing is, all this discussion helps us to close bugs too, like https://github.com/ContentMine/getpapers/issues/157 :wink:

But, no problem. I understand - I will keep quiet for a while... :smile: KUTGW! :+1:

sedimentation-fault commented 7 years ago

Let me break the silence and tell you the solution - there is no point in letting you tap in the dark. :smile:

Solution

This bug is actually two:

Do not redownload papers if they are already there

Reason

I thought this had to do with the '-r' option, but it does not. download.js already checks if a file with the same name is already there and does not retry to download it if there is one. BUT: the test fails for arxiv papers because their IDs contain slashes - and IDs are not sanitized in the function that does the check.

Solution

In download.js, replace

  var rename = urlObj.rename;
  var base = id + '/';

with

  var rename = sanitize(urlObj.rename);
  var base = sanitize(id) + '/';

Do not retry the query if the user passed the '-r' option

Solution

In arxiv.js, replace

arxiv.pageQuery();

with

  if (arxiv.opts.restart) {
        fs.readFile('arxiv_results.json', (err,data) => {
          if ((err) && (err.code == 'ENOENT')) {
            log.error('No existing download to restart')
            process.exit(1)
          }
          else if (err) {
            throw err
          }
          else {
            log.info('Restarting previous download')
            arxiv.allresults=JSON.parse(data)
            arxiv.addDlTasks()
          }
        } )

  }
  else {
    arxiv.pageQuery();
  }

Still in arxiv.js, add:

  arxiv.addDlTasks()

}

ArXiv.prototype.addDlTasks = function() {

  arxiv = this

after

   var filename = chalk.blue('arxiv_results.json')
   log.info('Full ArXiv result metadata written to ' + filename);

For your convenience, I attach two patches. Inspect before you apply them - the patch for download.js contains commented and debugging code too (I found it informative in this form, YMMV).

patches.zip