commoncrawl / cc-index-server

Common Crawl Index Server
http://index.commoncrawl.org/
65 stars 18 forks source link

Allow fl= parameter to request partially absent fields #9

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

If a field requested by the fl parameter is missing in one of the records, the query processing exits with an exception and the result list is truncated:

Traceback (most recent call last):
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 186, in to_text
    result = ' '.join(str(self[x]) for x in fields) + '\n'
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 186, in <genexpr>
    result = ' '.join(str(self[x]) for x in fields) + '\n'
KeyError: 'languages'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/var/venv/lib/python3.5/site-packages/pywb/framework/wbrequestresponse.py", line 221, in encode
    for obj in stream:
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxops.py", line 53, in cdx_to_text
    yield cdx.to_text(fields)
  File "/var/venv/lib/python3.5/site-packages/pywb/cdx/cdxobject.py", line 190, in to_text
    raise CDXException(msg)
pywb.cdx.cdxobject.CDXException: Invalid field "'languages'" found in fields= argument

The absence of a field should be handled. Ideally fl=url,languages and fl=url should return the same number of results with no/empty values for the missing fields.

Currently, the URL index is still based on PyWB 0.33.2. PyWB 2.3.0 just crashes with non-existing fields (param name is fields, see #8) and output=text:

  File ".../pywb/warcserver/index/cdxobject.py", line 186, in to_text
    result = ' '.join(str(self[x]) for x in fields) + '\n'
  File ".../pywb/warcserver/index/cdxobject.py", line 186, in <genexpr>
    result = ' '.join(str(self[x]) for x in fields) + '\n'
KeyError: 'languages'
sebastian-nagel commented 3 years ago

Ok, this will work with PyWB 2.5.0 (see webrecorder/pywb@92e459b in cdxobject.py).