ArchiveTeam / warrior-code2

Boot scripts for the ArchiveTeam Warrior 2
The Unlicense
24 stars 8 forks source link

The Warrior should use its own DNS resolver to avoid "helpful" search pages, etc #4

Open hannahwhy opened 10 years ago

hannahwhy commented 10 years ago

In the wretch.cc grab, some Warriors are returning this sort of stuff in their wget.log WARC records:

2013-12-29 08:20:35 URL:http://www.website-unavailable.com/?wc=EWJrGhd5BxxfBBxwGAkKEw==&url=www%2Ewretch%2Ecc%2Fblog%2Fst20281 [5495/5495] -> "/data/data/projects/wretch-content-ff4284b/data/13882276102a527b03b74ba313-681/st20281/wget.tmp" [1]

The website-unavailable.com stuff is an OpenDNS "service" that redirects users to search pages on DNS lookup failure.

This is a source of inconsistency in the Warrior that we can (and should) eliminate. DNS lookup errors ought to be reported (and recorded!) the same way across all grabbers.

To eliminate this problem, I propose that the Warrior run its own DNS resolver and cache and that the Warrior VM be set to use it. I prefer djbdns or the Debian dbndns fork, but there are other good choices.

hannahwhy commented 10 years ago

One potential problem: some people may be running Warriors in environments that forbid outgoing DNS queries to anything outside of a predefined set of DNS servers. (I've never heard of this, but it's definitely possible.)

I think this could be addressed by using the Warrior's resolver as the primary nameserver and the DHCP-provided nameservers as secondary nameservers. It would also be nice to log when this condition is detected, so that we can get some idea of how common this sort of thing is.

hannahwhy commented 10 years ago

So I just saw this:

https://github.com/ArchiveTeam/warrior-code2/blob/master/warrior-install.sh#L41-L49

I was under the impression that dnsmasq provided DNS resolution service on its own. Is this true? (If it is, that makes this behavior a mystery to me.)

chfoo commented 10 years ago

I thought dnsmasq was just used for caching DNS requests.The virtual machine should either be passing the host's DNS settings or providing its own DNS server that forwards requests to the host's DNS settings. dnsmasq should be picking up these servers from the virtual machine.

Cross-reference: ArchiveTeam/seesaw-kit#28