ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

Warnings, error and nothing gets saved #139

Open solstag opened 8 years ago

solstag commented 8 years ago
$ node node_modules/getpapers/bin/getpapers.js -V
0.4.10

$ node node_modules/getpapers/bin/getpapers.js -q '(KW:"global health") AND (PUB_TYPE:
"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")' -o data/globalhealth -x -a
info: Searching using eupmc API
warn: We had to retry the last request 2 times.
info: Found 10733 results
Retrieving results [===---------------------------] 9% (eta 0.0s)warn: We had to retry the last request 2
 times.                                                                                                  
Retrieving results [======------------------------] 19% (eta 409.1s)warn: We had to retry the last reques
t 2 times.                                                                                               
Retrieving results [========----------------------] 28% (eta 498.2s)warn: We had to retry the last reques
t 2 times.
Retrieving results [===========-------------------] 37% (eta 497.5s)warn: We had to retry the last reques
t 2 times.
Retrieving results [==============----------------] 47% (eta 462.9s)warn: We had to retry the last reques
t 2 times.
Retrieving results [=================-------------] 56% (eta 416.8s)warn: We had to retry the last reques
t 3 times.
Retrieving results [====================----------] 65% (eta 387.0s)warn: We had to retry the last reques
t 3 times.
Retrieving results [======================--------] 75% (eta 301.6s)warn: We had to retry the last reques
t 2 times.
Retrieving results [=========================-----] 84% (eta 193.3s)warn: We had to retry the last reques
t 3 times.
Retrieving results [============================--] 93% (eta 86.3s)warn: We had to retry the last request
 3 times.
Retrieving results [==============================] 100% (eta 0.2s)warn: We had to retry the last request
 50 times.
error: Malformed or empty response from EuropePMC. Try running again. Perhaps your query is wrong.

$ ls data/globalhealth/

$
solstag commented 8 years ago

Running a second time without the '-x' worked despite the warnings. i still don't understand what the warnings mean in practical terms. Will try running again with '-x' to see if the error was something occasional.

solstag commented 8 years ago

Ok, two observations...

  1. The error does not seem occasional: I tried again with the same results.
  2. The error is contingent on the size of the request: I limited the query in time to get 1/10 of the volume and it just worked.

I find this kinda weird since just removing '-x' makes it work, however the error apparently takes place before downloading the full texts.

tarrow commented 8 years ago

When you remove the -x are you also removing the -a? This is a possible area that we don't handle very well.

Just adding or removing '-x' by itself shouldn't have any impact on the metadata download. I would suggest this may be correlation != causation. i.e. bad luck that it worked without but not with.

You could run getpapers -x -o data/globalhealth -r if you have one successful download of all the metadata in data/globalhealth to try and use that to perform just the xml downloading.

In general the warnings mean that there is a problem with the internet connection between you and EuPMC. It's there so that people don't just assume very long wait times are due to us/them being slow. So far my experience (in the last few weeks) of running it on a stable wired connection is that I don't get any of the warns. We have seen them when people run on a laptop over WiFi.

solstag commented 8 years ago

I was not removing the '-a' when removing the '-x'.

I only run things from a university's datacenter connection in Paris, so connectivity issues are unlikely.

Using '-r' seems to work, except that it complains if I don't provide a query string, which is kinda weird if it ain't replaying the query.

tarrow commented 8 years ago

ah, the latter thing is definitely a problem in the code. I'll fix that.

The warnings, unfortunately, only come as a result of a connectivity issue. Or at least I can't figure out where else they come from. Although I have a theory that the last warn and then failure might be because we keep looking for a page of results and there aren't any to come. For example because EuPMC overreported the number of results.

solstag commented 8 years ago

Good news is I've now ran

node node_modules/getpapers/bin/getpapers.js -q '(KW:"global health") AND (PUB_TYPE:"Review" OR PUB_TYPE:"review-article" OR PUB_TYPE:"Meta-Analysis")' -o data/globalhealthnew -xa

and it worked all the way to the end, including downloading files.

It emitted warnings as usual, and accused a few duplicate entries.

I'm positive there is no connectivity issue (unless it's on EuPMC's side, which also seems unlikely). It might be that EuPMC reacts badly to big requests, maybe they have memory issues and their process dies unexpectedly, leaving us with timed out or malformed responses.

tarrow commented 8 years ago

I suppose that is a possibility; alternatively it could be that there is some filtering between you and EuPMC (at either end) that misinterprets our repeated requests as abuse of some sort and throttles/drops the packets.