ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
553 stars 77 forks source link

Crash while attempting to write a failed FTP request to WARC #366

Open JustAnotherArchivist opened 7 years ago

JustAnotherArchivist commented 7 years ago

ArchiveBot job 41itk9y0cfjrqlvs9p7oom8qf on pipeline:bebdb3defd488e849cc1736124defc7b crashed with the following trace:

150 ftp://ftp.me.gov.ar/curriform/nap/1ero_natura.pdf
ERROR Fetching ‘ftp://ftp.me.gov.ar/curriform/nap/1ero_natura.pdf’ encountered an error: Connection closed.
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/network/connection.py", line 299, in run_network_operation
    data = yield from asyncio.wait_for(task, wait_timeout)
  File "/usr/lib/python3.5/asyncio/tasks.py", line 393, in wait_for
    return fut.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 292, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 241, in _step
    result = coro.throw(exc)
  File "/usr/lib/python3.5/asyncio/streams.py", line 75, in open_connection
    lambda: protocol, host, port, **kwds)
  File "/usr/lib/python3.5/asyncio/base_events.py", line 765, in create_connection
    raise exceptions[0]
  File "/usr/lib/python3.5/asyncio/base_events.py", line 752, in create_connection
    yield from self.sock_connect(sock, address)
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 421, in sock_connect
    return (yield from fut)
  File "/usr/lib/python3.5/asyncio/futures.py", line 379, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.5/asyncio/tasks.py", line 297, in _wakeup
    future.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 292, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/selector_events.py", line 451, in _sock_connect_cb
    raise OSError(err, 'Connect call failed %s' % (address,))
ConnectionRefusedError: [Errno 111] Connect call failed ('168.83.90.10', 21)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/ftp.py", line 265, in _fetch
    response = yield from session.start(request)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/ftp/client.py", line 132, in start
    yield from self._prepare_fetch(request, response)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/ftp/client.py", line 216, in _prepare_fetch
    yield from self._control_stream.reconnect()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/ftp/stream.py", line 100, in reconnect
    yield from self._connection.connect()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/network/pool.py", line 382, in connect
    yield from self._primary_connection.connect()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/network/connection.py", line 194, in connect
    name='Connect')
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/network/connection.py", line 327, in run_network_operation
    error.errno, os.strerror(error.errno)) from error
wpull.errors.ConnectionRefused: [Errno 111] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 292, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/ftp.py", line 100, in process
    return (yield from session.process())
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/ftp.py", line 151, in process
    wait_time = yield from self._fetch(request, is_file)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/ftp.py", line 292, in _fetch
    response.body, duration_timeout=duration_timeout)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/abstract/client.py", line 53, in __exit__
    self.recycle()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/ftp/client.py", line 410, in recycle
    connection_closed=self._control_connection.closed()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/application/hook.py", line 134, in notify
    callback(*args, **kwargs)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/warc/recorder.py", line 665, in end_control
    hostname, port = self._request_hostname_port()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/warc/recorder.py", line 714, in _request_hostname_port
    hostname = self._request.address[0]
AttributeError: 'NoneType' object has no attribute 'address'
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.

To me, this looks like wpull is trying to write an unsuccessful request (Connection refused) to WARC for some reason. Note that this was a job on an FTP server; perhaps there's a bug in the FTP processor? I've seen refused connections before on HTTP jobs not causing any problems, so it's clearly not a general issue with the handling of those network errors.

(Since this is an ArchiveBot job, I don't know the details of the underlying system, including OS, Python, and wpull version numbers.)

JustAnotherArchivist commented 7 years ago

ArchiveBot job 30rcn1ipdnylllxuvfj1ibw7t on pipeline:cd22d63492adbb142a8db717f7235f26 – also an FTP job – failed with a similar error, but triggered by a timed out DNS request:

125 ftp://ftp3.conae.gov.ar/productos/INPE/INPE_FOG_201706230900.jpg
ERROR Fetching ‘ftp://ftp3.conae.gov.ar/productos/INPE/INPE_FOG_201706230900.jpg’ encountered an error: Read timed out.
ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/dns.py", line 255, in _query_dns
    answer = yield from event_loop.run_in_executor(None, query)
  File "/usr/lib/python3.4/asyncio/futures.py", line 388, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/usr/lib/python3.4/asyncio/tasks.py", line 286, in _wakeup
    value = future.result()
  File "/usr/lib/python3.4/asyncio/futures.py", line 277, in result
    raise self._exception
  File "/usr/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/resolver.py", line 1041, in query
    timeout = self._compute_timeout(start)
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/resolver.py", line 858, in _compute_timeout
    raise Timeout(timeout=duration)
dns.exception.Timeout: The DNS operation timed out after 30.006741285324097 seconds

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 265, in _fetch
    response = yield from session.start(request)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/ftp/client.py", line 132, in start
    yield from self._prepare_fetch(request, response)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/ftp/client.py", line 216, in _prepare_fetch
    yield from self._control_stream.reconnect()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/ftp/stream.py", line 100, in reconnect
    yield from self._connection.connect()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/pool.py", line 375, in connect
    result = yield from self._resolver.resolve(self._address[0])
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/dns.py", line 206, in resolve
    answer = yield from self._query_dns(host, family)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/dns.py", line 267, in _query_dns
    ) from error
wpull.errors.NetworkError: DNS resolution error: The DNS operation timed out after 30.006741285324097 seconds

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/usr/lib/python3.4/asyncio/futures.py", line 277, in result
    raise self._exception
  File "/usr/lib/python3.4/asyncio/tasks.py", line 233, in _step
    result = coro.throw(exc)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 100, in process
    return (yield from session.process())
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 151, in process
    wait_time = yield from self._fetch(request, is_file)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/ftp.py", line 292, in _fetch
    response.body, duration_timeout=duration_timeout)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/abstract/client.py", line 53, in __exit__
    self.recycle()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/ftp/client.py", line 410, in recycle
    connection_closed=self._control_connection.closed()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/application/hook.py", line 134, in notify
    callback(*args, **kwargs)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/warc/recorder.py", line 665, in end_control
    hostname, port = self._request_hostname_port()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/warc/recorder.py", line 714, in _request_hostname_port
    hostname = self._request.address[0]
AttributeError: 'NoneType' object has no attribute 'address'
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.
Pokechu22 commented 1 year ago

The basic issue here is that in FTPWARCRecorderSession, self._request is None because begin_control was never called but end_control was called (from recycle). This is easy to reproduce by trying to connect to a nonexistent FTP server, e.g. wpull ftp://www.example.com/ --warc-file www.example.com.

Pokechu22 commented 1 year ago

This patch seems to fix it:

diff --git a/wpull/warc/recorder.py b/wpull/warc/recorder.py
index 3d8c3dc..342bb0f 100644
--- a/wpull/warc/recorder.py
+++ b/wpull/warc/recorder.py
@@ -662,6 +662,9 @@ class FTPWARCRecorderSession(BaseWARCRecorderSession):
         )

     def end_control(self, response: FTPResponse, connection_closed=False):
+        if self._request is None:
+            return
+
         hostname, port = self._request_hostname_port()

         if connection_closed:

but it's fairly jank (especially since there is/should be a valid request in recycle()).