ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
554 stars 77 forks source link

ValueError: Field missing colon. #455

Open JustAnotherArchivist opened 4 years ago

JustAnotherArchivist commented 4 years ago

ArchiveBot job cvzb7ihnhmp19g5bkylnowrr3 (Python 3.6.10, wpull 2.0.3) crashed a few months ago with this traceback:

2020-01-26 21:49:49,449 - wpull.application.app - ERROR - Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/processor/web.py", line 92, in process
    return (yield from session.process())
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/processor/web.py", line 186, in process
    yield from self._process_loop()
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/processor/web.py", line 245, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/processor/web.py", line 287, in _fetch_one
    duration_timeout=self._fetch_rule.duration_timeout
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/protocol/http/web.py", line 131, in download
    self._current_session.download(file, duration_timeout=duration_timeout)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/protocol/http/client.py", line 154, in download
    yield from asyncio.wait_for(read_future, timeout=duration_timeout)
  File "/home/archivebot/.pyenv/versions/3.6.10/lib/python3.6/asyncio/tasks.py", line 358, in wait_for
    return fut.result()
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/protocol/abstract/stream.py", line 17, in wrapper
    return (yield from func(self, *args, **kwargs))
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/protocol/http/stream.py", line 200, in read_body
    yield from self._read_body_by_chunk(response, file, raw=raw)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/protocol/http/stream.py", line 365, in _read_body_by_chunk
    response.fields.parse(trailer_data)
  File "/home/archivebot/.pyenv/versions/3.6.10/envs/archivebot-3.6.10/lib/python3.6/site-packages/wpull/namevalue.py", line 52, in parse
    raise ValueError('Field missing colon.')
ValueError: Field missing colon.

This must have been caused by one of these URLs:

$ sqlite3 wpull.db 'SELECT queued_urls.*, url_strings.* FROM queued_urls JOIN url_strings ON url_string_id = url_strings.id WHERE status = "in_progress"'
88471|88471|60386|1|in_progress|0|6|1|media|0||||88471|http://www.tj-summerdavos.cn/2016/english/images/dws_Enhg_1404.jpg
88472|88472|60386|1|in_progress|0|6|1|media|0||||88472|http://www.tj-summerdavos.cn/2016/english/images/dws_En_038.jpg
88473|88473|60386|1|in_progress|0|6|1|media|0||||88473|http://www.tj-summerdavos.cn/2016/english/images/grad_left.png
88474|88474|60386|1|in_progress|0|6|1||0||||88474|http://www.enorth.com.cn/sys/online_calc.js?ver=1

I'm unable to reproduce it, but it's been a few months since the crash. The records in the WARC don't seem suspicious, but the offending response was likely never written to WARC anyway; the only remaining tmp-warcsesresp file is empty.

The traceback suggests that something made wpull think there would be trailers; it should be possible to construct a test case for this. The error should obviously be caught and handled rather than crashing the entire process.