ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
554 stars 76 forks source link

ValueError: Invalid IPv6 URL #405

Open JustAnotherArchivist opened 5 years ago

JustAnotherArchivist commented 5 years ago

ArchiveBot job 8fs0q596mf8jdoqi5vx8d2jpe on pipeline:6435d72104c55fe0a5113186c6ea64d8 just crashed with the following traceback:

ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/usr/lib/python3.5/asyncio/futures.py", line 274, in result
    raise self._exception
  File "/usr/lib/python3.5/asyncio/tasks.py", line 239, in _step
    result = coro.send(None)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/web.py", line 92, in process
    return (yield from session.process())
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/processor/web.py", line 182, in process
    self._new_initial_request()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/http/web.py", line 304, in session
    cookie_jar=self._cookie_jar,
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/http/web.py", line 54, in __init__
    self._add_cookies(self._next_request)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/protocol/http/web.py", line 195, in _add_cookies
    request, self._get_cookie_referrer_host()
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/cookiewrapper.py", line 79, in add_cookie_header
    new_request = convert_http_request(request, referrer_host)
  File "/home/archivebot/.local/lib/python3.5/site-packages/wpull/cookiewrapper.py", line 25, in convert_http_request
    origin_req_host=referrer_host,
  File "/usr/lib/python3.5/urllib/request.py", line 278, in __init__
    origin_req_host = request_host(self)
  File "/usr/lib/python3.5/urllib/request.py", line 256, in request_host
    host = urlparse(url)[1]
  File "/usr/lib/python3.5/urllib/parse.py", line 295, in urlparse
    splitresult = urlsplit(url, scheme, allow_fragments)
  File "/usr/lib/python3.5/urllib/parse.py", line 345, in urlsplit
    raise ValueError("Invalid IPv6 URL")
ValueError: Invalid IPv6 URL
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.

There are a few previous issues with the same error (#77, #121, #197), all of which were fixed.

This URL caused the crash:

http://[mailto:CSS-Scripts@phusewiki.org/?Subject=Join%20Working%20Group%20CSS-Scripts%20(at)%20phusewiki.org%5D
18750094817 commented 4 years ago

How to solve this problem?

JustAnotherArchivist commented 4 years ago

@18750094817 There is currently no workaround for this. It will likely have to be fixed the same way as the older issues I linked, by catching the exception, logging a warning, and ignoring the URL.

JustAnotherArchivist commented 1 month ago

Job cu727tfuzmoxoetu7bsjpux3o crashed today with the same traceback but a weirder URL:

http://us%40er:p[ass@example.org-expected76-uri_class/

This shouldn't trigger IPv6 parsing to begin with. So that may be an upstream bug as well (if it didn't get fixed since Python 3.6).