Open sedimentation-fault opened 7 years ago
@sedimentation-fault at this point I think it's worth saying that you've discovered enough bugs that it's clear we need to rewrite some fundamentals. We knew about some of these internally for some time, but haven't had the resources to resolve them. If you could refrain from reporting new issues for now that would be helpful - the existing reports have enough detail that we have a good set of goals to test against when rewriting. At this point more issues just gives us more admin to do afterwards. We really appreciate the detailed analysis so far though :)
The good thing is, all this discussion helps us to close bugs too, like https://github.com/ContentMine/getpapers/issues/157 :wink:
But, no problem. I understand - I will keep quiet for a while... :smile: KUTGW! :+1:
Let me break the silence and tell you the solution - there is no point in letting you tap in the dark. :smile:
This bug is actually two:
I thought this had to do with the '-r' option, but it does not. download.js already checks if a file with the same name is already there and does not retry to download it if there is one. BUT: the test fails for arxiv papers because their IDs contain slashes - and IDs are not sanitized in the function that does the check.
In download.js, replace
var rename = urlObj.rename;
var base = id + '/';
with
var rename = sanitize(urlObj.rename);
var base = sanitize(id) + '/';
In arxiv.js, replace
arxiv.pageQuery();
with
if (arxiv.opts.restart) {
fs.readFile('arxiv_results.json', (err,data) => {
if ((err) && (err.code == 'ENOENT')) {
log.error('No existing download to restart')
process.exit(1)
}
else if (err) {
throw err
}
else {
log.info('Restarting previous download')
arxiv.allresults=JSON.parse(data)
arxiv.addDlTasks()
}
} )
}
else {
arxiv.pageQuery();
}
Still in arxiv.js, add:
arxiv.addDlTasks()
}
ArXiv.prototype.addDlTasks = function() {
arxiv = this
after
var filename = chalk.blue('arxiv_results.json')
log.info('Full ArXiv result metadata written to ' + filename);
For your convenience, I attach two patches. Inspect before you apply them - the patch for download.js contains commented and debugging code too (I found it informative in this form, YMMV).
Description
While appending '-r' to a getpapers command will not retry to download PDFs that are already there for queries with the EUPMC API, the same is NOT the case with the arxiv API - it not only retries the metadata query, but also each and every PDF, from the first to the last one.
This is very annoying if you have 24000 PDFs to download, of which a few thousand have already been downloaded in previous attempts.
How to reproduce
Run
getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p
Let it download some number of PDFs, then interrupt it and retry with
getpapers --api 'arxiv' --query 'cat:math.DG' --outdir math.DG -p -r
Side effects
An annoying side-effect of this is that the web server will reply with a
416 Range Not Satisfiable
error, if the downloader (be it the standard requestretry from master code, or a curl wrapper from a custom fork, as in https://github.com/ContentMine/getpapers/issues/157) tries a resumable download (i.e. if it tries to resume the download from where it stopped, in this case from the end of an already completely downloaded file, a byte range whose start is certainly not satisfiable). If the error is not caught, the script will stop at the first file that is already there, with the above 416 error.
This side effect may not occur in the master code (or it may appear as a 416 caught error, I didn't check).