ArchiveTeam / wpull

Wget-compatible web downloader and crawler.
GNU General Public License v3.0
555 stars 77 forks source link

Crash while resolving a hostname #365

Open JustAnotherArchivist opened 7 years ago

JustAnotherArchivist commented 7 years ago

Earlier today, an ArchiveBot job of mine (ID bsrj06s1dk5fv0nepd9ke1clw) crashed in the DNS handling with the following stacktrace:

ERROR Fatal exception.
Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/inet.py", line 104, in is_multicast
    first = ord(dns.ipv4.inet_aton(text)[0])
TypeError: ord() expected string of length 1, but int found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/inet.py", line 108, in is_multicast
    first = ord(dns.ipv6.inet_aton(text)[0])
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/ipv6.py", line 153, in inet_aton
    raise dns.exception.SyntaxError
dns.exception.SyntaxError: Text input is malformed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/application/app.py", line 157, in run
    yield from pipeline.process()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 194, in process
    yield from self._process_one_worker()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 215, in _process_one_worker
    task.result()
  File "/usr/local/lib/python3.4/asyncio/futures.py", line 277, in result
    raise self._exception
  File "/usr/local/lib/python3.4/asyncio/tasks.py", line 232, in _step
    result = coro.throw(exc)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 119, in process
    item = yield from self.process_one(_worker_id=worker_id)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/pipeline/pipeline.py", line 103, in process_one
    yield from task.process(item)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/application/tasks/download.py", line 492, in process
    yield from session.app_session.factory['Processor'].process(session)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/delegate.py", line 29, in process
    return (yield from processor.process(item_session))
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 92, in process
    return (yield from session.process())
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 186, in process
    yield from self._process_loop()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 245, in _process_loop
    exit_early, wait_time = yield from self._fetch_one(cast(Request, self._item_session.request))
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/processor/web.py", line 268, in _fetch_one
    response = yield from self._web_client_session.start()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/http/web.py", line 107, in start
    response = yield from session.start(request)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/http/client.py", line 87, in start
    yield from self._stream.reconnect()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/protocol/http/stream.py", line 438, in reconnect
    yield from self._connection.connect()
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/pool.py", line 375, in connect
    result = yield from self._resolver.resolve(self._address[0])
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/dns.py", line 206, in resolve
    answer = yield from self._query_dns(host, family)
  File "/home/archivebot/.local/lib/python3.4/site-packages/wpull/network/dns.py", line 255, in _query_dns
    answer = yield from event_loop.run_in_executor(None, query)
  File "/usr/local/lib/python3.4/asyncio/futures.py", line 388, in __iter__
    yield self  # This tells Task to wait for completion.
  File "/usr/local/lib/python3.4/asyncio/tasks.py", line 285, in _wakeup
    value = future.result()
  File "/usr/local/lib/python3.4/asyncio/futures.py", line 277, in result
    raise self._exception
  File "/usr/local/lib/python3.4/concurrent/futures/thread.py", line 54, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/resolver.py", line 962, in query
    source_port=source_port)
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/query.py", line 243, in udp
    (dns.inet.is_multicast(where) and
  File "/home/archivebot/.local/lib/python3.4/site-packages/dns/inet.py", line 111, in is_multicast
    raise ValueError
ValueError
CRITICAL Sorry, Wpull unexpectedly crashed.
CRITICAL Please report this problem to the authors at Wpull's issue tracker so it may be fixed. If you know how to program, maybe help us fix it? Thank you for helping us help you help us all.

Unfortunately, I don't know the exact domain name or other circumstances which caused this crash as ArchiveBot's dashboard isn't exactly verbose. Perhaps the person running the corresponding pipeline could provide more details.

Asparagirl commented 7 years ago

What pipeline was it on?

JustAnotherArchivist commented 7 years ago

@Asparagirl Pipeline b77f66d2bba045105106122324b1b164

JustAnotherArchivist commented 7 years ago

Cadbury gave me the database of this job. Let me know if you want it (141 MiB compressed). I don't see anything suspicious in there:

> sqlite3 wpull.db 'SELECT queued_urls.*, url_strings.* FROM queued_urls JOIN url_strings ON queued_urls.url_string_id = url_strings.id;' | grep -C 5 '|in_progress|'
947930|947930|285306|1|done|1|5|1|media|0||200||947930|http://img.wennermedia.com/featured-promo-320/sessions-0b482422-9dbd-4cde-b464-33871d4c80c8.jpg
947931|947931|285306|1|done|1|5|1|media|0||200||947931|http://img.wennermedia.com/featured-promo-724/trumpcare-republicans-dont-want-you-to-see-a3cfb2d9-c520-4619-9a71-c7f93db7afa2.jpg
947932|947932|285306|1|done|1|5|1|media|0||200||947932|http://img.wennermedia.com/featured-promo-620/trumpcare-republicans-dont-want-you-to-see-a3cfb2d9-c520-4619-9a71-c7f93db7afa2.jpg
947933|947933|285306|1|done|1|5|1|media|0||200||947933|http://img.wennermedia.com/featured-promo-320/gettyimages-693753868-db35b383-097d-4a21-b2d3-3ce87767d94f.jpg
947934|947934|285306|1|done|1|5|1|media|0||200||947934|http://img.wennermedia.com/featured-promo-724/gettyimages-677867828-011000ae-d3d3-4282-a3c0-90e65948c103.jpg
947935|947935|285306|1|in_progress|0|5|1|media|0||||947935|http://img.wennermedia.com/featured-promo-620/gettyimages-694983590-08cae9bc-3238-4ead-a145-2b7db5aadc9d.jpg
947936|947936|285306|1|in_progress|0|5|1|media|0||||947936|http://img.wennermedia.com/featured-promo-320/rs-nsa-dae36da1-5b2a-412b-94c0-626a189aa691.jpg
947937|947937|285306|1|in_progress|0|5|1|media|0||||947937|http://img.wennermedia.com/featured-promo-724/gettyimages-693771966-16c8c921-aa31-4ac3-a2c1-f44bb8b1497d.jpg
947938|947938|285306|1|in_progress|0|5|1|media|0||||947938|http://img.wennermedia.com/homepage-horizontal-620/rachel-maddow-interview-rolling-stone-c863d19f-a37a-4028-88b7-6126445073fc.jpg
947939|947939|285306|1|todo|0|5|1|media|0||||947939|http://img.wennermedia.com/featured-promo-724/rtx3964z-4d14db17-d3bd-444e-a8d2-772bf39a9b2b.jpg
947940|947940|285306|1|todo|0|5|1|media|0||||947940|http://img.wennermedia.com/featured-promo-620/rtx3964z-4d14db17-d3bd-444e-a8d2-772bf39a9b2b.jpg
947941|947941|285306|1|todo|0|5|1|media|0||||947941|http://img.wennermedia.com/featured-promo-320/gettyimages-693790210-4bdf250e-1e3e-412f-be91-783ddddbcca2.jpg
947942|947942|285306|1|todo|0|5|1|media|0||||947942|http://img.wennermedia.com/featured-promo-320/comey-hearing-4e566ee5-7349-47c5-b181-cb7abde2e02c.jpg
947943|947943|285306|1|todo|0|5|1|media|0||||947943|http://img.wennermedia.com/featured-promo-320/gettyimages-693771966-16c8c921-aa31-4ac3-a2c1-f44bb8b1497d.jpg

The pipeline uses Google DNS (8.8.4.4 and 8.8.8.8), by the way.

JustAnotherArchivist commented 7 years ago

ArchiveBot job 5e1ikgmizt88z0z4m5n1w7d8r crashed today (I think) on the same pipeline with the identical stack trace. Here's the queue from around the crash (same command as the previous comment; thanks, Cadbury):

5252132|5252132|1116069|1|skipped|1|6|||0||||5252132|http://pandce.proboards.com/post/10995715/quote/510257
5252133|5252133|1116069|1|done|1|6|||0||200||5252133|http://pandce.proboards.com/post/10995723
5252134|5252134|1116069|1|done|1|6|||0||200||5252134|http://pandce.proboards.com/post/10995804
5252135|5252135|1116069|1|done|1|6|||0||200||5252135|http://pandce.proboards.com/post/10995670
5252136|5252136|1116069|1|skipped|1|6|||0||||5252136|http://pandce.proboards.com/post/10995652/quote/510257
5252137|5252137|1116069|1|in_progress|0|6|||0||||5252137|http://pandce.proboards.com/post/11003291
5252138|5252138|1116067|1|in_progress|0|6|||0||||5252138|http://www.amazon.com/The-Crossing-Washington-American-evolution/dp/0439691869
5252139|5252139|1116067|1|in_progress|0|6|||0||||5252139|http://pandce.proboards.com/post/11054800/quote/512528
5252140|5252140|1116067|1|in_progress|0|6|||0||||5252140|http://pandce.proboards.com/post/11057539/quote/512528
5252141|5252141|1116067|1|todo|0|6|||0||||5252141|http://pandce.proboards.com/post/11056257
5252142|5252142|1116067|1|todo|0|6|||0||||5252142|http://pandce.proboards.com/post/11055209/quote/512528
5252143|5252143|1116067|1|todo|0|6|||0||||5252143|http://www.amazon.com/My-Dearest-Friend-Letters-Abigail/dp/0674057058
5252144|5252144|1116067|1|todo|0|6|||0||||5252144|http://pandce.proboards.com/post/11054717/quote/512528
5252145|5252145|1116067|1|todo|0|6|||0||||5252145|http://ads.proboards.com/ad.pl?as=300x250&ap=BTF&f=3435524&s=0&fc=7&d=pandce.proboards.com&g=&a=&cb=0.167331154137379&uid=0&name=desktop_bot_1

Again, this doesn't look suspicious to me. Certainly not a malformed URL or a particular broken host.

JustAnotherArchivist commented 6 years ago

Just happened again this week on one of my pipelines (which have their own resolver, i.e. query the authorative DNS servers) on job c5b3srgf4agspk0xmmat3jwvg. Nothing suspicious in the database at all; it was just retrieving forums.strainhunters.com URLs during the crash.

JustAnotherArchivist commented 6 years ago

After we had several of these crashes on Cadbury's pipeline in the past few days, I just looked at the source code a bit to try and figure out what's going wrong. Well, I don't have the slightest clue. The crash occurs while dnspython is trying to determine whether the nameserver IP is a multicast address. dnspython reads a list of nameservers from /etc/resolv.conf when the dns.resolver.Resolver object is created (during the startup of wpull). The only other modifications to that list of nameservers is shuffling it around (if that is enabled using something like options rotate in resolv.conf) and removing broken nameservers.

Then I looked at the lower level, specifically dns.ipv4.inet_aton and its use in dns.inet.is_multicast, and now I don't understand anything anymore. inet_aton returns a bytes object of length four (e.g. b'\x7f\x00\x00\x01' for localhost). Taking the first index of that will always return an int (127, in this case). So by that logic, it should always raise a TypeError (also in the ipv6 code). I'm probably missing something obvious; in my defence, it's past 4:00 here and I should've gone to bed hours ago. That code does look awful though with the generic except Exception.

Anyway, I feel like we need to capture the ValueError somewhere deep in dnspython and then raise our own exception which includes the value of the nameserver variable (and probably also some other variables from that context) if we want to figure out what the hell is going on. Hooking into dns.query.udp directly via monkey-patching is probably the easiest way for this.

I also found an issue on the dnspython repository with the same traceback: https://github.com/rthalley/dnspython/issues/302 . Unfortunately, the reporter there had the same experience as we do – the issue isn't reproducible. (NB, another issue on that repository, https://github.com/rthalley/dnspython/issues/235, is unrelated. Note that the traceback includes receive_udp, i.e. it occurs on parsing the response, not the nameserver address.)

JustAnotherArchivist commented 6 years ago

I've looked into this again. Specifically, I hammered wpull.network.dns.Resolver with requests until it broke. I monkey-patched dns.query.udp to print the parameters when the ValueError is raised. I used domains from a job that crashed on an ArchiveBot pipeline recently and also ran the script on that same pipeline, though I don't think it matters. The script can be found here. I used Python 3.6.5.

It didn't take long for that script to crash with the same error. However, the output doesn't look suspicious at all:

...
Resolving en.wikipedia.org
Resolving www.homepower.com
Resolving www.legis.ga.gov
(<DNS message, ID 27443>, '8.8.8.8', 29.999916315078735, 53) {'source': None, 'source_port': 0}
b'k3\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x03www\x05legis\x02ga\x03gov\x00\x00\x01\x00\x01'
Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 104, in is_multicast
    first = ord(dns.ipv4.inet_aton(text)[0])
TypeError: ord() expected string of length 1, but int found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 108, in is_multicast
    first = ord(dns.ipv6.inet_aton(text)[0])
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/ipv6.py", line 153, in inet_aton
    raise dns.exception.SyntaxError
dns.exception.SyntaxError: Text input is malformed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "debugdns.py", line 148, in <module>
    loop.run_until_complete(main(resolver, hosts))
  File "/usr/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
    return future.result()
  File "debugdns.py", line 143, in main
    await resolver.resolve(host)
  File "/home/archivebot/.local/lib/python3.6/site-packages/wpull/network/dns.py", line 206, in resolve
    answer = yield from self._query_dns(host, family)
  File "/home/archivebot/.local/lib/python3.6/site-packages/wpull/network/dns.py", line 255, in _query_dns
    answer = yield from event_loop.run_in_executor(None, query)
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/resolver.py", line 962, in query
    source_port=source_port)
  File "debugdns.py", line 11, in udp
    return orig_dns_query_udp(message, *args, **kwargs)
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/query.py", line 243, in udp
    (dns.inet.is_multicast(where) and
  File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 111, in is_multicast
    raise ValueError
ValueError

This at least confirms that the nameserver (where inside dns.query.udp) value is fine.

I then realised that dns.query.udp only actually calls is_multicast in some cases, namely when _addresses_equal(af, from_address, destination) is False (short-circuiting in this if condition). As the name suggests, that function compares whether from_address (i.e. the address from which the response packet came) is equal to destination (i.e. where the query was sent). So is_multicast is only used if the response packet source doesn't match the query packet destination.

I then had a look at is_multicast and dns.ipv4.inet_aton. My interpretation above was completely correct; is_multicast will always fail, at least on Python 3.

It seems that this issue has been fixed since on the master branch of dnspython. The current implementation of is_multicast (specifically, the maybe_ord function defined in dns._compat) calls ord only on Python 2. There is however no release of dnspython yet with this fix.

JustAnotherArchivist commented 6 years ago

Here's a script to test whether dnspython handles such packets correctly. Since it patches some internal methods of dns.query, it'll likely break at some point and not work with old versions, but it definitely supports both the most recent release (1.15.0 from 2016) and the current master branch (commit 4e2a2a59). 1.15.0 causes the ValueError, patched dns.inet.is_multicast and 4e2a2a59 pass (i.e. time out).

JustAnotherArchivist commented 5 years ago

This should be fixed with dnspython version 1.16.0, but I haven't verified that yet.