chfoo / warcat

Tool and library for handling Web ARChive (WARC) files.
GNU General Public License v3.0
150 stars 21 forks source link

http.client.IncompleteRead crash during extract #6

Closed chfoo closed 10 years ago

chfoo commented 10 years ago
Traceback (most recent call last):
  File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/0/home/waxy/usr/local/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 154, in <module>
    main()
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 70, in main
    command_info[1](args)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/__main__.py", line 131, in extract_command
    tool.process()
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 112, in process
    raise e
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 106, in process
    self.action(record)
  File "/0/home/waxy/usr/local/lib/python3.4/site-packages/warcat/tool.py", line 229, in action
    shutil.copyfileobj(response, f)
  File "/0/home/waxy/usr/local/lib/python3.4/shutil.py", line 66, in copyfileobj
    buf = fsrc.read(length)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 500, in read
    return super(HTTPResponse, self).read(amt)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 529, in readinto
    return self._readinto_chunked(b)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 621, in _readinto_chunked
    n = self._safe_readinto(mvb)
  File "/0/home/waxy/usr/local/lib/python3.4/http/client.py", line 680, in _safe_readinto
    raise IncompleteRead(bytes(mvb[0:total_bytes]), len(b))
http.client.IncompleteRead: IncompleteRead(7052 bytes read, 16384 more expected)
waxpancake commented 10 years ago

A little more information to reproduce this crash... I was running warcat on this 25GB megawarc using this command:

python3 -m warcat extract ~/archives/incoming/upcoming_20130420095943.megawarc.warc.gz --output-dir expanded/ --verbose --progress

It dies right after extracting this file:

INFO:warcat.tool:Extracted <urn:uuid:6eebc1d1-cdda-4e1a-b499-184e9681f1e6> to expanded/upcoming.yahoo.com/event/2715307/LA/New-Orleans/The-Louisiana-State-Museum-Jazz-Collection/Louisiana-State-Museum/_index_da39a3

Hope that helps.