Open JustAnotherArchivist opened 7 years ago
What pipeline was it on?
@Asparagirl Pipeline b77f66d2bba045105106122324b1b164
Cadbury gave me the database of this job. Let me know if you want it (141 MiB compressed). I don't see anything suspicious in there:
> sqlite3 wpull.db 'SELECT queued_urls.*, url_strings.* FROM queued_urls JOIN url_strings ON queued_urls.url_string_id = url_strings.id;' | grep -C 5 '|in_progress|'
947930|947930|285306|1|done|1|5|1|media|0||200||947930|http://img.wennermedia.com/featured-promo-320/sessions-0b482422-9dbd-4cde-b464-33871d4c80c8.jpg
947931|947931|285306|1|done|1|5|1|media|0||200||947931|http://img.wennermedia.com/featured-promo-724/trumpcare-republicans-dont-want-you-to-see-a3cfb2d9-c520-4619-9a71-c7f93db7afa2.jpg
947932|947932|285306|1|done|1|5|1|media|0||200||947932|http://img.wennermedia.com/featured-promo-620/trumpcare-republicans-dont-want-you-to-see-a3cfb2d9-c520-4619-9a71-c7f93db7afa2.jpg
947933|947933|285306|1|done|1|5|1|media|0||200||947933|http://img.wennermedia.com/featured-promo-320/gettyimages-693753868-db35b383-097d-4a21-b2d3-3ce87767d94f.jpg
947934|947934|285306|1|done|1|5|1|media|0||200||947934|http://img.wennermedia.com/featured-promo-724/gettyimages-677867828-011000ae-d3d3-4282-a3c0-90e65948c103.jpg
947935|947935|285306|1|in_progress|0|5|1|media|0||||947935|http://img.wennermedia.com/featured-promo-620/gettyimages-694983590-08cae9bc-3238-4ead-a145-2b7db5aadc9d.jpg
947936|947936|285306|1|in_progress|0|5|1|media|0||||947936|http://img.wennermedia.com/featured-promo-320/rs-nsa-dae36da1-5b2a-412b-94c0-626a189aa691.jpg
947937|947937|285306|1|in_progress|0|5|1|media|0||||947937|http://img.wennermedia.com/featured-promo-724/gettyimages-693771966-16c8c921-aa31-4ac3-a2c1-f44bb8b1497d.jpg
947938|947938|285306|1|in_progress|0|5|1|media|0||||947938|http://img.wennermedia.com/homepage-horizontal-620/rachel-maddow-interview-rolling-stone-c863d19f-a37a-4028-88b7-6126445073fc.jpg
947939|947939|285306|1|todo|0|5|1|media|0||||947939|http://img.wennermedia.com/featured-promo-724/rtx3964z-4d14db17-d3bd-444e-a8d2-772bf39a9b2b.jpg
947940|947940|285306|1|todo|0|5|1|media|0||||947940|http://img.wennermedia.com/featured-promo-620/rtx3964z-4d14db17-d3bd-444e-a8d2-772bf39a9b2b.jpg
947941|947941|285306|1|todo|0|5|1|media|0||||947941|http://img.wennermedia.com/featured-promo-320/gettyimages-693790210-4bdf250e-1e3e-412f-be91-783ddddbcca2.jpg
947942|947942|285306|1|todo|0|5|1|media|0||||947942|http://img.wennermedia.com/featured-promo-320/comey-hearing-4e566ee5-7349-47c5-b181-cb7abde2e02c.jpg
947943|947943|285306|1|todo|0|5|1|media|0||||947943|http://img.wennermedia.com/featured-promo-320/gettyimages-693771966-16c8c921-aa31-4ac3-a2c1-f44bb8b1497d.jpg
The pipeline uses Google DNS (8.8.4.4 and 8.8.8.8), by the way.
ArchiveBot job 5e1ikgmizt88z0z4m5n1w7d8r crashed today (I think) on the same pipeline with the identical stack trace. Here's the queue from around the crash (same command as the previous comment; thanks, Cadbury):
5252132|5252132|1116069|1|skipped|1|6|||0||||5252132|http://pandce.proboards.com/post/10995715/quote/510257
5252133|5252133|1116069|1|done|1|6|||0||200||5252133|http://pandce.proboards.com/post/10995723
5252134|5252134|1116069|1|done|1|6|||0||200||5252134|http://pandce.proboards.com/post/10995804
5252135|5252135|1116069|1|done|1|6|||0||200||5252135|http://pandce.proboards.com/post/10995670
5252136|5252136|1116069|1|skipped|1|6|||0||||5252136|http://pandce.proboards.com/post/10995652/quote/510257
5252137|5252137|1116069|1|in_progress|0|6|||0||||5252137|http://pandce.proboards.com/post/11003291
5252138|5252138|1116067|1|in_progress|0|6|||0||||5252138|http://www.amazon.com/The-Crossing-Washington-American-evolution/dp/0439691869
5252139|5252139|1116067|1|in_progress|0|6|||0||||5252139|http://pandce.proboards.com/post/11054800/quote/512528
5252140|5252140|1116067|1|in_progress|0|6|||0||||5252140|http://pandce.proboards.com/post/11057539/quote/512528
5252141|5252141|1116067|1|todo|0|6|||0||||5252141|http://pandce.proboards.com/post/11056257
5252142|5252142|1116067|1|todo|0|6|||0||||5252142|http://pandce.proboards.com/post/11055209/quote/512528
5252143|5252143|1116067|1|todo|0|6|||0||||5252143|http://www.amazon.com/My-Dearest-Friend-Letters-Abigail/dp/0674057058
5252144|5252144|1116067|1|todo|0|6|||0||||5252144|http://pandce.proboards.com/post/11054717/quote/512528
5252145|5252145|1116067|1|todo|0|6|||0||||5252145|http://ads.proboards.com/ad.pl?as=300x250&ap=BTF&f=3435524&s=0&fc=7&d=pandce.proboards.com&g=&a=&cb=0.167331154137379&uid=0&name=desktop_bot_1
Again, this doesn't look suspicious to me. Certainly not a malformed URL or a particular broken host.
Just happened again this week on one of my pipelines (which have their own resolver, i.e. query the authorative DNS servers) on job c5b3srgf4agspk0xmmat3jwvg
. Nothing suspicious in the database at all; it was just retrieving forums.strainhunters.com
URLs during the crash.
After we had several of these crashes on Cadbury's pipeline in the past few days, I just looked at the source code a bit to try and figure out what's going wrong. Well, I don't have the slightest clue. The crash occurs while dnspython is trying to determine whether the nameserver IP is a multicast address. dnspython reads a list of nameservers from /etc/resolv.conf
when the dns.resolver.Resolver
object is created (during the startup of wpull). The only other modifications to that list of nameservers is shuffling it around (if that is enabled using something like options rotate
in resolv.conf
) and removing broken nameservers.
Then I looked at the lower level, specifically dns.ipv4.inet_aton
and its use in dns.inet.is_multicast
, and now I don't understand anything anymore. inet_aton
returns a bytes object of length four (e.g. b'\x7f\x00\x00\x01'
for localhost). Taking the first index of that will always return an int
(127, in this case). So by that logic, it should always raise a TypeError
(also in the ipv6
code). I'm probably missing something obvious; in my defence, it's past 4:00 here and I should've gone to bed hours ago. That code does look awful though with the generic except Exception
.
Anyway, I feel like we need to capture the ValueError
somewhere deep in dnspython and then raise our own exception which includes the value of the nameserver
variable (and probably also some other variables from that context) if we want to figure out what the hell is going on. Hooking into dns.query.udp
directly via monkey-patching is probably the easiest way for this.
I also found an issue on the dnspython repository with the same traceback: https://github.com/rthalley/dnspython/issues/302 . Unfortunately, the reporter there had the same experience as we do – the issue isn't reproducible. (NB, another issue on that repository, https://github.com/rthalley/dnspython/issues/235, is unrelated. Note that the traceback includes receive_udp
, i.e. it occurs on parsing the response, not the nameserver address.)
I've looked into this again. Specifically, I hammered wpull.network.dns.Resolver
with requests until it broke. I monkey-patched dns.query.udp
to print the parameters when the ValueError
is raised. I used domains from a job that crashed on an ArchiveBot pipeline recently and also ran the script on that same pipeline, though I don't think it matters. The script can be found here. I used Python 3.6.5.
It didn't take long for that script to crash with the same error. However, the output doesn't look suspicious at all:
...
Resolving en.wikipedia.org
Resolving www.homepower.com
Resolving www.legis.ga.gov
(<DNS message, ID 27443>, '8.8.8.8', 29.999916315078735, 53) {'source': None, 'source_port': 0}
b'k3\x01\x00\x00\x01\x00\x00\x00\x00\x00\x00\x03www\x05legis\x02ga\x03gov\x00\x00\x01\x00\x01'
Traceback (most recent call last):
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 104, in is_multicast
first = ord(dns.ipv4.inet_aton(text)[0])
TypeError: ord() expected string of length 1, but int found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 108, in is_multicast
first = ord(dns.ipv6.inet_aton(text)[0])
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/ipv6.py", line 153, in inet_aton
raise dns.exception.SyntaxError
dns.exception.SyntaxError: Text input is malformed.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "debugdns.py", line 148, in <module>
loop.run_until_complete(main(resolver, hosts))
File "/usr/lib/python3.6/asyncio/base_events.py", line 468, in run_until_complete
return future.result()
File "debugdns.py", line 143, in main
await resolver.resolve(host)
File "/home/archivebot/.local/lib/python3.6/site-packages/wpull/network/dns.py", line 206, in resolve
answer = yield from self._query_dns(host, family)
File "/home/archivebot/.local/lib/python3.6/site-packages/wpull/network/dns.py", line 255, in _query_dns
answer = yield from event_loop.run_in_executor(None, query)
File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/resolver.py", line 962, in query
source_port=source_port)
File "debugdns.py", line 11, in udp
return orig_dns_query_udp(message, *args, **kwargs)
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/query.py", line 243, in udp
(dns.inet.is_multicast(where) and
File "/home/archivebot/.local/lib/python3.6/site-packages/dns/inet.py", line 111, in is_multicast
raise ValueError
ValueError
This at least confirms that the nameserver
(where
inside dns.query.udp
) value is fine.
I then realised that dns.query.udp
only actually calls is_multicast
in some cases, namely when _addresses_equal(af, from_address, destination)
is False
(short-circuiting in this if condition). As the name suggests, that function compares whether from_address
(i.e. the address from which the response packet came) is equal to destination
(i.e. where the query was sent). So is_multicast
is only used if the response packet source doesn't match the query packet destination.
I then had a look at is_multicast
and dns.ipv4.inet_aton
. My interpretation above was completely correct; is_multicast
will always fail, at least on Python 3.
It seems that this issue has been fixed since on the master branch of dnspython. The current implementation of is_multicast
(specifically, the maybe_ord
function defined in dns._compat
) calls ord
only on Python 2. There is however no release of dnspython yet with this fix.
Here's a script to test whether dnspython handles such packets correctly. Since it patches some internal methods of dns.query
, it'll likely break at some point and not work with old versions, but it definitely supports both the most recent release (1.15.0 from 2016) and the current master branch (commit 4e2a2a59). 1.15.0 causes the ValueError
, patched dns.inet.is_multicast
and 4e2a2a59 pass (i.e. time out).
This should be fixed with dnspython version 1.16.0, but I haven't verified that yet.
Earlier today, an ArchiveBot job of mine (ID bsrj06s1dk5fv0nepd9ke1clw) crashed in the DNS handling with the following stacktrace:
Unfortunately, I don't know the exact domain name or other circumstances which caused this crash as ArchiveBot's dashboard isn't exactly verbose. Perhaps the person running the corresponding pipeline could provide more details.