Closed Querela closed 3 years ago
Ok. I found the record.http_headers.status_line
attribute but the .status_code
does not work?
Same example as above:
>>> record.headers.status_code # expected
>>> record.headers.status_line
'WARC/1.0'
>>> record.http_headers.status_code # unexpected
>>> record.http_headers.status_line
'HTTP/1.1 301 Moved Permanently'
I believe the following line should be changed: https://github.com/chatnoir-eu/chatnoir-resiliparse/blob/87d2ef40982891783075738714e09a12a5e2d184/fastwarc/warc.pyx#L205
# split at most 2 times, so the possible third part contains the whole reason phrase
s = self._status_line.split(b' ', 2)
It seems as if Status-Line = HTTP-Version SP Status-Code SP Reason-Phrase CRLF
does not mean the SP
can't appear in then Reason-Phrase
, as is later specified with Reason-Phrase = *<TEXT, excluding CR, LF>
in section 6.1.1.
I think I oversaw something. And this might have already existed before the updated status code parsing:
>>> record.http_headers.status_line
'HTTP/1.1 200 OK'
>>> record.http_headers.status_code
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "fastwarc/warc.pyx", line 208, in fastwarc.warc.WarcHeaderMap.status_code.__get__
TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
>>>
should be changed to int(s[1])
, as s
is still a list.
I'm somewhat unsure how this came to be, as the parsing of the status line should be correct but I would still not have gotten the correct status code integer, so my pull request was also only half-finished.
The fieldWarcRecord.http_headers
could include the HTTP status code or it could be provided as an extra attribute to WarcRecord.When reading a record it is not easily visible what status code a response had. For example, if I would like to only filter301
redirection content, I'm not able to do this, as far as I can see. (Or just filter200
responses for further processing.) The other HTTP headers are parsed but not the HTTP status line which has a simple format, e. g.HTTP/1.X XXX Description
, that could be integrated to the existing HTTP header parsing. I also found no simple way like.reader
to access the HTTP communication.Example:
HTTP communication: