Open miguelwon opened 6 years ago
Using cdx-server and the matchType=prefix to try to get all the results from a given domain is not a good idea. In general is not a good idea to use our APIs to try to get very extensive lists of results.
Hi,
Is there any update on this issue? Output from CDX API is still inconsistent. I have limited the time period, trying to avoid an extensive list of results, and even so the problems remain. Using the example giving in the API documentation I get results when using
or
but for 2017 and 2018 no results are returned:
Temporal filters using the full date format "20150107000000", are very slow: over 30 seconds.
If no filter is applied, the CDX API starts sending results immediately.
The script is done. Next step will be to use the new patching
I'm trying to extract all "dn.pt" urls within a given time interval. Although it works for some cases, for many others the output of a request is not consistent. For example the following request:
http://arquivo.pt/wayback/cdx?url=dn.pt/&matchType=prefix&from=201010010000&to=201011010000&filter==mime:text/html&fl=url,timestamp,filename,status&output=json
Apparently some chunk outputs invalid characters. For example using python with requests (urllib2 results in the same error):