EFForg / badger-sett

Automated training for Privacy Badger. Badger Sett automates browsers to visit websites to produce fresh Privacy Badger tracker data.
https://www.eff.org/badger-pretraining
MIT License
121 stars 15 forks source link

Clean list of domains to visit by removing invalid entries (non-websites, etc.) #21

Closed ghostwords closed 4 years ago

ghostwords commented 6 years ago

We could make a little script using https://funilrys.github.io/PyFunceble/ that will clean our list of domains to visit before we visit them. This should speed up the crawl and reduce our error rate, as most failures are caused by unreachable websites: https://github.com/EFForg/badger-sett/issues/18#issuecomment-407625356

bcyphers commented 6 years ago

I tried out PyFunceble in #23. Only 45 of the top 2k sites are not "ACTIVE", so the tool is helpful but not perfect. It's unclear how our scan is generating so many DNS NOT FOUND errors that PyFunceble seems to ignore.

Another issue we have to deal with that might not be covered by https://github.com/EFForg/badger-sett/issues/18#issuecomment-407625356 is all the content-hosting subdomains and URL shorteners that are present in the Majestic list (e.g. goo.gl, t.co, wp.com, 1.bp.blogspot.com). It's really a shame that Alexa is out of date, because it seemed much more skewed towards sites people actually visit.

funilrys commented 6 years ago

~Hi there,~ ~developer of PyFunceble in here!~

~Thanks for using PyFunceble!~

~Just wanted to say that I'm ready to take all remarks on PyFunceble. So if you find something which is not correct or you have question please let me know. If you prefer to contact me personally, please do! You can contact me per email or Keybase.~

~I'll be interested to see how you run your scan to find DNS NOT FOUND as it might help me improve and solve that problematic. Is it a script or something else ?~

~About subdomains, let me redirect you to the SPECIAL source documentation. Indeed, we already cover blogspot domains but I'll be happy to improve my tool for other content-hosting providers.~

Have a nice day/night.

Cheers, Nissar

jawz101 commented 5 years ago

fwiw - my personal experience is that I have to do at least an additional pass on my INACTIVE hosts before I consider it INACTIVE. I have a local pfSense box running Unbound as my DNS resolver with Quad9 set as the upstream resolver.

I don't know if it is that PyFunceble tries to process the list very fast but I almost think the INACTIVE list should be automatically processed at least 1 more time before making its final decision to mark it as INACTIVE.

funilrys commented 5 years ago

@jawz101 Interesting behavior. Do you have some data for me?

Also, Did you try to parse your DNS server IP directly to PyFunceble?

Cheers, Nissar

ghostwords commented 4 years ago

This should have been resolved by #23. At this point though we switched to the Tranco list in #45, which seems to mostly obviate the need to validate domains.