Closed ghostwords closed 4 years ago
I tried out PyFunceble in #23. Only 45 of the top 2k sites are not "ACTIVE"
, so the tool is helpful but not perfect. It's unclear how our scan is generating so many DNS NOT FOUND errors that PyFunceble seems to ignore.
Another issue we have to deal with that might not be covered by https://github.com/EFForg/badger-sett/issues/18#issuecomment-407625356 is all the content-hosting subdomains and URL shorteners that are present in the Majestic list (e.g. goo.gl, t.co, wp.com, 1.bp.blogspot.com). It's really a shame that Alexa is out of date, because it seemed much more skewed towards sites people actually visit.
~Hi there,~ ~developer of PyFunceble in here!~
~Thanks for using PyFunceble!~
~Just wanted to say that I'm ready to take all remarks on PyFunceble. So if you find something which is not correct or you have question please let me know. If you prefer to contact me personally, please do! You can contact me per email or Keybase.~
~I'll be interested to see how you run your scan to find DNS NOT FOUND
as it might help me improve and solve that problematic. Is it a script or something else ?~
~About subdomains, let me redirect you to the SPECIAL source documentation.
Indeed, we already cover blogspot
domains but I'll be happy to improve my tool for other content-hosting providers.~
Have a nice day/night.
Cheers, Nissar
fwiw - my personal experience is that I have to do at least an additional pass on my INACTIVE hosts before I consider it INACTIVE. I have a local pfSense box running Unbound as my DNS resolver with Quad9 set as the upstream resolver.
I don't know if it is that PyFunceble tries to process the list very fast but I almost think the INACTIVE list should be automatically processed at least 1 more time before making its final decision to mark it as INACTIVE.
@jawz101 Interesting behavior. Do you have some data for me?
Also, Did you try to parse your DNS server IP directly to PyFunceble?
Cheers, Nissar
This should have been resolved by #23. At this point though we switched to the Tranco list in #45, which seems to mostly obviate the need to validate domains.
We could make a little script using https://funilrys.github.io/PyFunceble/ that will clean our list of domains to visit before we visit them. This should speed up the crawl and reduce our error rate, as most failures are caused by unreachable websites: https://github.com/EFForg/badger-sett/issues/18#issuecomment-407625356