PromyLOPh / crocoite

Web archiving using Google Chrome
https://6xq.net/crocoite/
MIT License
42 stars 7 forks source link

behavior: Ignore invalid URLs when extracting #18

Closed PromyLOPh closed 5 years ago

PromyLOPh commented 5 years ago

Otherwise the whole grab will fail.

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 659, in _encode_host
    ip = ip_address(ip)
  File "/usr/lib64/python3.6/ipaddress.py", line 54, in ip_address
    address)
ValueError: 'neue_preise_f' does not appear to be an IPv4 or IPv6 address

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 662, in _encode_host
    host = idna.encode(host, uts46=True).decode("ascii")
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 358, in encode
    s = alabel(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 270, in alabel
    ulabel(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 304, in ulabel
    check_label(label)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/idna/core.py", line 261, in check_label
    raise InvalidCodepoint('Codepoint {0} at position {1} of {2} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
idna.core.InvalidCodepoint: Codepoint U+005F at position 5 of 'neue_preise_f%c3%bcr_zahnimplantate_k%c3%b6nnten_sie_%c3%bcberraschen' not allowed

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite-sandbox/lib64/python3.6/encodings/idna.py", line 167, in encode
    raise UnicodeError("label too long")
UnicodeError: label too long

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/data/home/chromebot/crocoite/crocoite/cli.py", line 102, in single
    loop.run_until_complete(run)
  File "/usr/lib64/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "/data/home/chromebot/crocoite/crocoite/controller.py", line 223, in run
    async for item in b.onfinish ():
  File "/data/home/chromebot/crocoite/crocoite/behavior.py", line 351, in onfinish
    yield ExtractLinksEvent (list (set (map (URL, result['result']['value']))))
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 168, in __new__
    val.username, val.password, host, port, encode=True
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 676, in _make_netloc
    ret = cls._encode_host(host)
  File "/data/home/chromebot/crocoite-sandbox/lib/python3.6/site-packages/yarl/__init__.py", line 664, in _encode_host
    host = host.encode("idna").decode("ascii")
UnicodeError: encoding with 'idna' codec failed (UnicodeError: label too long)