chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
147 stars 21 forks source link

'utf-8' codec can't decode byte invalid continuation byte #12

Closed fanchyna closed 7 years ago

fanchyna commented 8 years ago

I've installed warcat on my server under Python 3.4. The warc.load() command to a warc file gives me the following error message:

>> warc.load("/gstorage01/external-data/internet-archive/archive.org/download/archiveteam_pdf_20160412083746/pdf_20160412083746.megawarc.warc.gz")
Content block length changed from 92850 to 92849
Content block length changed from 150326 to 150325
Content block length changed from 156258 to 156257
Content block length changed from 129362 to 129361
Content block length changed from 156196 to 156195
Content block length changed from 129336 to 129335
Content block length changed from 147763 to 147762
Content block length changed from 129338 to 129337
Content block length changed from 129350 to 129349
Content block length changed from 156195 to 156194
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 32, in load
    self.read_file_object(f)
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 39, in read_file_object
    record, has_more = self.read_record(file_object)
  File "/usr/lib/python3.4/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File "/usr/lib/python3.4/site-packages/warcat/model/record.py", line 68, in load
    content_type)
  File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 21, in load
    field_cls=HTTPHeader)
  File "/usr/lib/python3.4/site-packages/warcat/model/block.py", line 92, in load
    fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe1 in position 26: invalid continuation byte

The data is available from internet archive website that everyone can download. The size is about 130GB, but I don't think it should matter. The key issue is how does a codec error happen. 
jeffcasavant commented 8 years ago

I'm having the same issue I think. This is a WARC file that was built using the Internet Archive's warc library.

[jeff warc]$ warcat split my.warc.gz 
Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File "/usr/lib/python3.5/site-packages/warcat/__main__.py", line 126, in split_command
    tool.process()
  File "/usr/lib/python3.5/site-packages/warcat/tool.py", line 95, in process
    check_block_length=self.check_block_length)
  File "/usr/lib/python3.5/site-packages/warcat/model/warc.py", line 75, in read_record
    check_block_length=check_block_length)
  File "/usr/lib/python3.5/site-packages/warcat/model/record.py", line 68, in load
    content_type)
  File "/usr/lib/python3.5/site-packages/warcat/model/block.py", line 21, in load
    field_cls=HTTPHeader)
  File "/usr/lib/python3.5/site-packages/warcat/model/block.py", line 92, in load
    fields = field_cls.parse(file_obj.read(field_length).decode())
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 712: invalid start byte