I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:
Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
Malformed response from arXiv API - no data in feed
...
The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:
Set page size to 1000
I experimented with page sizes from 200 to 2000:
At 200, it takes ages to get all 10000+ results and you run a higher risk of entering the above-mentioned infinite loop of death due to the much-increased number of extra queries required to fetch them all.
At 2000, you get many responses that contain far less than 2000 results - yet the feed is not completely empty, so this is currently not detected. See https://github.com/ContentMine/getpapers/issues/177 for a description of this bug and a solution.
At 500, it still takes too long to get them all.
At 1000, you get more results at once, you finish faster, you send less queries - and the risk of entering the infinite loop of death is not higher than with just 500. Plus: you don't automatically get just 200 results back, as seems to be the case with 2000...
I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:
arxiv.pagesize = 1000
Set a higher delay between retries
I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set
arxiv.page_delay = 20000
in getpapers/lib/arxiv.js
Do not urlencode the whole query URL, only the parts that need it
I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:
The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:
category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug
In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:
Set page size to 1000
I experimented with page sizes from 200 to 2000:
I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:
arxiv.pagesize = 1000
Set a higher delay between retries
I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set
arxiv.page_delay = 20000
in getpapers/lib/arxiv.js
Do not urlencode the whole query URL, only the parts that need it
See https://github.com/ContentMine/getpapers/issues/178 for this.
Correct bug where the results feed is not empty - but not full either...
See https://github.com/ContentMine/getpapers/issues/177 for details.
Last but not least...(I will repeat myself on this): do yourself a favour and spoof your User Agent in getpapers/lib/config.js:
config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'
With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)