ContentMine / getpapers

Get metadata, fulltexts or fulltext URLs of papers matching a search query
MIT License
197 stars 37 forks source link

"Malformed response from arXiv API - no data in feed" woes... #179

Open sedimentation-fault opened 5 years ago

sedimentation-fault commented 5 years ago

I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:

Malformed response from arXiv API - no data in feed Malformed response from arXiv API - no data in feed Malformed response from arXiv API - no data in feed ...

The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:

category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug

In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:

Set page size to 1000

I experimented with page sizes from 200 to 2000:

I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:

arxiv.pagesize = 1000

Set a higher delay between retries

I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set

arxiv.page_delay = 20000

in getpapers/lib/arxiv.js

Do not urlencode the whole query URL, only the parts that need it

See https://github.com/ContentMine/getpapers/issues/178 for this.

Correct bug where the results feed is not empty - but not full either...

See https://github.com/ContentMine/getpapers/issues/177 for details.

Last but not least...(I will repeat myself on this): do yourself a favour and spoof your User Agent in getpapers/lib/config.js:

config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'

With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)