"Malformed response from arXiv API - no data in feed" woes...

I have been having a hard time to get past my queries lately - they get into almost infinite loops of messages like:

Malformed response from arXiv API - no data in feed Malformed response from arXiv API - no data in feed Malformed response from arXiv API - no data in feed ...

The queries actually return far less than 50000 results, the supposed limit of arxiv's API - they reside anywhere between 3000 and 12000 results. Here is an example:

category='math.AG'; start_date='20170101'; end_date='20190827'; getpapers --api 'arxiv' --query "cat:$category AND lastUpdatedDate:[${start_date}* TO ${end_date}*] " --outdir "$category" -p -l debug

In this issue (not strictly a 'bug') I document my attempts to get past those showstoppers. Here's what I did:

Set page size to 1000

I experimented with page sizes from 200 to 2000:

At 200, it takes ages to get all 10000+ results and you run a higher risk of entering the above-mentioned infinite loop of death due to the much-increased number of extra queries required to fetch them all.
At 2000, you get many responses that contain far less than 2000 results - yet the feed is not completely empty, so this is currently not detected. See https://github.com/ContentMine/getpapers/issues/177 for a description of this bug and a solution.
At 500, it still takes too long to get them all.
At 1000, you get more results at once, you finish faster, you send less queries - and the risk of entering the infinite loop of death is not higher than with just 500. Plus: you don't automatically get just 200 results back, as seems to be the case with 2000...

I thus settled for a page size of 1000 in getpapers/lib/arxiv.js:

arxiv.pagesize = 1000

Set a higher delay between retries

I experimented with various delays too: the default 3 seconds are really too fast a hammering. 30 seconds is too much of sleeping. 15 or 20 seconds seem to be O.K., so I have set

arxiv.page_delay = 20000

in getpapers/lib/arxiv.js

Do not urlencode the whole query URL, only the parts that need it

See https://github.com/ContentMine/getpapers/issues/178 for this.

Correct bug where the results feed is not empty - but not full either...

See https://github.com/ContentMine/getpapers/issues/177 for details.

Last but not least...(I will repeat myself on this): do yourself a favour and spoof your User Agent in getpapers/lib/config.js:

config.userAgent = 'Mozilla/5.0 (X11; Linux x86_64; rv:58.0) Gecko/20100101 Firefox/58.0'

With the above changes in place, things have been getting better for me - and I hope the same for you too! :-)

ContentMine / getpapers