arquivo / pwa-technologies

Arquivo.pt main goal is the preservation and access of web contents that are no longer available online. During the developing of the PWA IR (information retrieval) system we faced limitations in searching speed, quality of results, scalability and usability. To cope with this, we modified the archive-access project (http://archive-access.sourceforge.net/) to support our web archive IR requirements. Nutchwax, Nutch and Wayback’s code were adapted to meet the requirements. Several optimizations were added, such as simplifications in the way document versions are searched and several bottlenecks were resolved. The PWA search engine is a public service at http://archive.pt and a research platform for web archiving. As it predecessor Nutch, it runs over Hadoop clusters for distributed computing following the map-reduce paradigm. Its major features include fast full-text search, URL search, phrase search, faceted search (date, format, site), and sorting by relevance and date. The PWA search engine is highly scalable and its architecture is flexible enough to enable the deployment of different configurations to respond to the different needs. Currently, it serves an archive collection searchable by full-text with 180 million documents ranging between 1996 and 2010.
http://www.arquivo.pt
GNU General Public License v3.0
38 stars 7 forks source link

Inconsistent json output from CDX server API #401

Open miguelwon opened 6 years ago

miguelwon commented 6 years ago

I'm trying to extract all "dn.pt" urls within a given time interval. Although it works for some cases, for many others the output of a request is not consistent. For example the following request:

http://arquivo.pt/wayback/cdx?url=dn.pt/&matchType=prefix&from=201010010000&to=201011010000&filter==mime:text/html&fl=url,timestamp,filename,status&output=json

Apparently some chunk outputs invalid characters. For example using python with requests (urllib2 results in the same error):

>>> import requests
>>> req = requests.get('http://arquivo.pt/wayback/cdx?url=dn.pt/&matchType=prefix&from=201010010000&to=201011010000&filter==mime:text/html&fl=url,timestamp,filename,status&output=json')

Traceback (most recent call last):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 543, in _update_chunk_length
    self.chunk_left = int(line, 16)
ValueError: invalid literal for int() with base 16: b''

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 745, in generate
    for chunk in self.raw.stream(chunk_size, decode_content=True):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 432, in stream
    for line in self.read_chunked(amt, decode_content=decode_content):
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 626, in read_chunked
    self._original_response.close()
  File "/Users/miguelwon/anaconda3/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/urllib3/response.py", line 320, in _error_catcher
    raise ProtocolError('Connection broken: %r' % e, e)
urllib3.exceptions.ProtocolError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/api.py", line 72, in get
    return request('get', url, params=params, **kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/api.py", line 58, in request
    return session.request(method=method, url=url, **kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 508, in request
    resp = self.send(prep, **send_kwargs)
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/sessions.py", line 658, in send
    r.content
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 823, in content
    self._content = bytes().join(self.iter_content(CONTENT_CHUNK_SIZE)) or bytes()
  File "/Users/miguelwon/anaconda3/lib/python3.6/site-packages/requests/models.py", line 748, in generate
    raise ChunkedEncodingError(e)
requests.exceptions.ChunkedEncodingError: ('Connection broken: IncompleteRead(0 bytes read)', IncompleteRead(0 bytes read))
Fernando-Melo commented 4 years ago

Using cdx-server and the matchType=prefix to try to get all the results from a given domain is not a good idea. In general is not a good idea to use our APIs to try to get very extensive lists of results.

miguelwon commented 3 years ago

Hi,

Is there any update on this issue? Output from CDX API is still inconsistent. I have limited the time period, trying to avoid an extensive list of results, and even so the problems remain. Using the example giving in the API documentation I get results when using

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20150101000000&to=20150107000000&fl=urlkey&output=json

or

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20160101000000&to=20160107000000&fl=urlkey&output=json

but for 2017 and 2018 no results are returned:

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20170101000000&to=20170107000000&fl=urlkey&output=json

https://arquivo.pt/wayback/cdx?url=sapo.pt/noticias/*&from=20180101000000&to=20180107000000&fl=urlkey&output=json

amourao commented 3 years ago

Temporal filters using the full date format "20150107000000", are very slow: over 30 seconds.

If no filter is applied, the CDX API starts sending results immediately.

PedroG1515 commented 2 years ago

The script is done. Next step will be to use the new patching