DNSCrypt / dnscrypt-proxy

dnscrypt-proxy 2 - A flexible DNS proxy, with support for encrypted DNS protocols.
https://dnscrypt.info
ISC License
11.4k stars 1.01k forks source link

Errors on generate-domains-blacklist.py #449

Closed pguizeline closed 6 years ago

pguizeline commented 6 years ago

Hi!

First let me thank you for this awesome tool! It works great! I've just noticed some little annoyances. I'm using Wally3k's ticked lists as my domains-blacklist.conf since it's the closest to a set it and forget it solution, that way my domains-whitelist.txt needs very little maintenance.

While running "python generate-domains-blacklist.py > blacklist.txt" I need to comment out two blocklists: 1) https://adaway.org/hosts.txt 2) https://www.malwaredomainlist.com/hostslist/hosts.txt

The first one I think I've found the issue, it's regarding the user-agent according to this: https://github.com/pi-hole/pi-hole/pull/1366

Now for the second link, I think the problem lies in it's instability, when it times out the whole script halts and I'm left with a blank blacklist.txt file.

I'm sorry I can't help with any pull request to this since I'm code dumb. Either way, for now I'm using the proxy with DoH + blocklists and it's by far the fastest and easiest solution at this time!

Thanks for all your work!

jedisct1 commented 6 years ago

We could work around Adaway's rules it by changing the user agent and pretend to be a browser.

From a different perspective, if they explicitly block robots, we should respect their choice instead of playing cat and mouse to bypass their rules.

Maybe one thing we can do is add an option to use a custom user agent, but by default, keep the current one that doesn't lie.

Unreliable sources is an issue. When something goes wrong, the script exits with a non-zero exit code. So you can store everything into a temporary file, and rename it only after the whole script successfully completes. This is what I do to generate the mybase.txt file. You never end up with empty files that way. The worst that can happen is that you keep using the current good version.

jedisct1 commented 6 years ago

That's the exact script I use to generate mybase.txt:

cd /home/j/src/dnscrypt-proxy/utils/generate-domains-blacklists && \
 python generate-domains-blacklist.py -t 30 > \
 /var/www/www/download.dnscrypt/blacklists/domains/mybase.txt.tmp &&
 mv -f /var/www/www/download.dnscrypt/blacklists/domains/mybase.txt.tmp \
 /var/www/www/download.dnscrypt/blacklists/domains/mybase.txt
pguizeline commented 6 years ago

Hi Frank! Thanks for your reply!

You're completely right, if Adaway prevents robots we shouldn't crawl their address, even spoofing the user-agent would be kinda of a rude move.

I've tweaked your script a little bit and it's working exactly as expected! Now when malwaredomainlist fails I don't lose my original blacklist.txt.

Thanks again for all your help! I should close this ticket right?