chatnoir-eu / chatnoir-resiliparse

A robust web archive analytics toolkit
https://resiliparse.chatnoir.eu
Apache License 2.0
80 stars 11 forks source link

Fix HTTP status code parsing (reason phrase may contain spaces) #3

Closed Querela closed 3 years ago

Querela commented 3 years ago

The field WarcRecord.http_headers could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.

When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter 301 redirection content, I'm not able to do this, as far as I can see. (Or just filter 200 responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g. HTTP/1.X XXX Description, that could be integrated to the existing HTTP header parsing. I also found no simple way like .reader to access the HTTP communication.

Example:

>>> record.headers
{'WARC-Type': 'response', 'WARC-Target-URI': 'http://vgperson.com/robots.txt', 'WARC-Date': '2021-08-09T13:25:55Z', 'WARC-Payload-Digest': 'sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP', 'WARC-IP-Address': '85.214.122.46', 'WARC-Record-ID': '<urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>', 'Content-Type': 'application/http; msgtype=response', 'Content-Length': '454'}
>>> record.http_headers
{'Date': 'Mon, 09 Aug 2021 13:25:53 GMT', 'Server': 'Apache', 'Location': 'https://vgperson.com/robots.txt', 'Content-Length': '239', 'Connection': 'close', 'Content-Type': 'text/html; charset=iso-8859-1'}
>>> content = record.reader.read()
>>> assert len(content) == record.content_length  # content only includes the real content, no access to HTTP stuff
>>> print(content)
b'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">\n<html><head>\n<title>301 Moved Permanently</title>\n</head><body>\n<h1>Moved Permanently</h1>\n<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>\n</body></html>\n'

HTTP communication:

WARC/1.0
WARC-Type: response
WARC-Target-URI: http://vgperson.com/robots.txt
WARC-Date: 2021-08-09T13:25:55Z
WARC-Payload-Digest: sha1:OLD2B4B3YRYMAUJAQNATRPULWOOXO3YP
WARC-IP-Address: 85.214.122.46
WARC-Record-ID: <urn:uuid:5eeb5cea-38e6-4904-a4f9-077a162bc0d6>
Content-Type: application/http; msgtype=response
Content-Length: 454

HTTP/1.1 301 Moved Permanently
Date: Mon, 09 Aug 2021 13:25:53 GMT
Server: Apache
Location: https://vgperson.com/robots.txt
Content-Length: 239
Connection: close
Content-Type: text/html; charset=iso-8859-1

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>301 Moved Permanently</title>
</head><body>
<h1>Moved Permanently</h1>
<p>The document has moved <a href="https://vgperson.com/robots.txt">here</a>.</p>
</body></html>
Querela commented 3 years ago

Ok. I found the record.http_headers.status_line attribute but the .status_code does not work? Same example as above:

>>> record.headers.status_code  # expected
>>> record.headers.status_line
'WARC/1.0'
>>> record.http_headers.status_code  # unexpected
>>> record.http_headers.status_line
'HTTP/1.1 301 Moved Permanently'

I believe the following line should be changed: https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/87d2ef40982891783075738714e09a12a5e2d184/fastwarc/warc.pyx#L205

# split at most 2 times, so the possible third part contains the whole reason phrase
s = self._status_line.split(b' ', 2)

It seems as if Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF does not mean the SP can't appear in then Reason-Phrase, as is later specified with Reason-Phrase = *<TEXT, excluding CR, LF> in section 6.1.1.

Querela commented 3 years ago

I think I oversaw something. And this might have already existed before the updated status code parsing:

>>> record.http_headers.status_line
'HTTP/1.1 200 OK'
>>> record.http_headers.status_code
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "fastwarc/warc.pyx", line 208, in fastwarc.warc.WarcHeaderMap.status_code.__get__
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
>>>

https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/dc37ee7e0eb73f1eb31c9cd708b42350a773fdcc/fastwarc/warc.pyx#L208

should be changed to int(s[1]), as s is still a list. I'm somewhat unsure how this came to be, as the parsing of the status line should be correct but I would still not have gotten the correct status code integer, so my pull request was also only half-finished.