Open solstag opened 8 years ago
Running a second time without the '-x' worked despite the warnings. i still don't understand what the warnings mean in practical terms. Will try running again with '-x' to see if the error was something occasional.
Ok, two observations...
I find this kinda weird since just removing '-x' makes it work, however the error apparently takes place before downloading the full texts.
When you remove the -x are you also removing the -a? This is a possible area that we don't handle very well.
Just adding or removing '-x' by itself shouldn't have any impact on the metadata download. I would suggest this may be correlation != causation. i.e. bad luck that it worked without but not with.
You could run getpapers -x -o data/globalhealth -r
if you have one successful download of all the metadata in data/globalhealth to try and use that to perform just the xml downloading.
In general the warnings mean that there is a problem with the internet connection between you and EuPMC. It's there so that people don't just assume very long wait times are due to us/them being slow. So far my experience (in the last few weeks) of running it on a stable wired connection is that I don't get any of the warns. We have seen them when people run on a laptop over WiFi.
I was not removing the '-a' when removing the '-x'.
I only run things from a university's datacenter connection in Paris, so connectivity issues are unlikely.
Using '-r' seems to work, except that it complains if I don't provide a query string, which is kinda weird if it ain't replaying the query.
ah, the latter thing is definitely a problem in the code. I'll fix that.
The warnings, unfortunately, only come as a result of a connectivity issue. Or at least I can't figure out where else they come from. Although I have a theory that the last warn and then failure might be because we keep looking for a page of results and there aren't any to come. For example because EuPMC overreported the number of results.
Good news is I've now ran
node node_modules/getpapers/bin/getpapers.js -q '(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")' -o data/globalhealthnew -xa
and it worked all the way to the end, including downloading files.
It emitted warnings as usual, and accused a few duplicate entries.
I'm positive there is no connectivity issue (unless it's on EuPMC's side, which also seems unlikely). It might be that EuPMC reacts badly to big requests, maybe they have memory issues and their process dies unexpectedly, leaving us with timed out or malformed responses.
I suppose that is a possibility; alternatively it could be that there is some filtering between you and EuPMC (at either end) that misinterprets our repeated requests as abuse of some sort and throttles/drops the packets.