Inconsistent json output from CDX server API

miguelwon commented 6 years ago

I'm trying to extract all "dn.pt" urls within a given time interval. Although it works for some cases, for many others the output of a request is not consistent. For example the following request:

http://arquivo.pt/wayback/cdx?url=dn.pt/&matchType=prefix&from=201010010000&to=201011010000&filter==mime:text/html&fl=url,timestamp,filename,status&output=json

Apparently some chunk outputs invalid characters. For example using python with requests (urllib2 results in the same error):

>>> import requests
>>> req = requests.get('http://arquivo.pt/wayback/cdx?url=dn.pt/&matchType=prefix&from=201010010000&to=201011010000&filter==mime:text/html&fl=url,timestamp,filename,status&output=json')

Traceback (most recent call last):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 543, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 745, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 432, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 626, in read_chunked
    self._original_response.close()
  File "/Users/miguelwon/anaconda3/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 320, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 658, in send
    r.content
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 823, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 748, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

Fernando-Melo commented 4 years ago

Using cdx-server and the matchType=prefix to try to get all the results from a given domain is not a good idea. In general is not a good idea to use our APIs to try to get very extensive lists of results.

miguelwon commented 3 years ago

Hi,

Is there any update on this issue? Output from CDX API is still inconsistent. I have limited the time period, trying to avoid an extensive list of results, and even so the problems remain. Using the example giving in the API documentation I get results when using

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20150101000000&to=20150107000000&fl=urlkey&output=json

or

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20160101000000&to=20160107000000&fl=urlkey&output=json

but for 2017 and 2018 no results are returned:

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20170101000000&to=20170107000000&fl=urlkey&output=json

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20180101000000&to=20180107000000&fl=urlkey&output=json

amourao commented 3 years ago

Temporal filters using the full date format "20150107000000", are very slow: over 30 seconds.

If no filter is applied, the CDX API starts sending results immediately.

PedroG1515 commented 2 years ago

The script is done. Next step will be to use the new patching

arquivo / pwa-technologies

Inconsistent json output from CDX server API #401