StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
25.71k stars 2.15k forks source link

Funceble the Aggregate #412

Closed ScriptTiger closed 6 years ago

ScriptTiger commented 6 years ago

I don't know if @funilrys is just being cautious or shy or what, but I will just raise this topic for him. Would it make sense to integrate funceble into this repo and cleanse the aggregate as part of the hosts generation every time? PR needed/wanted? It seems like the resultant files would just get an added layer of curation, but I think maybe @mitchellkrogza and @funilrys are best suited to answer this. Obviously it would be redundant in the case of the entries taken from @mitchellkrogza, as he already uses it, but it may be useful as an add layer of curation, as I said, on top of the other sources that don't use it.

Of course the time it takes to generate the files would be impacted, so that would be entirely up to @StevenBlack. But I know @funilrys and @mitchellkrogza have been tweaking a bunch, so I'm not sure how much they can vary the generation times, or any other specs that @StevenBlack might be interested in tweaking to be in keeping with the project's core purposes.

Could anyone that has experience with this provide estimates for the efficacy of this if run on this repo? Like generation times, diff percentage of cleansed to original, etc.?

mitchellkrogza commented 6 years ago

@ScriptTiger I have setup a central repo for controlling whitelisting, false positives, dead domains and use this data to strip out relevant entries from my other repos. Instead of re inventing the wheel why not just contribute to the central repo and use this one source for controlling removal of dead / inactive domains, false positives and whitelisted domains?

https://github.com/mitchellkrogza/CENTRAL-REPO.Dead.Inactive.Whitelisted.Domains.For.Hosts.Projects

My Ultimate Hosts repo at https://github.com/mitchellkrogza/Ultimate.Hosts.Blacklist already used this central repo to do list cleaning during builds and the same with Badd Boyz Hosts too https://github.com/mitchellkrogza/Badd-Boyz-Hosts

Efforts might best be spent sending in PR’s and Issues on the Central Repo as I have found that this is truly going to be the best way going forward

ScriptTiger commented 6 years ago

Would you be open to expanding your central whitelist to support the extensions here? If you added the inactive domains from all of the extensions here (or just the most restrictive hosts file which includes all of the extensions: https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/fakenews-gambling-porn-social/hosts) , it would essentially do the same thing. If you're not using an extension, it won't affect you because your list won't have that entry to whitelist. And if you are using an extension, than it will streamline your hosts file and strip out the dead ones. So rather than using funceble at all, people would be using your pre-generated funceble whitelist to speed things up.

Keeping the above in mind, this repo could then easily apply your whitelist as its default source whitelist and massively speed up cleansing on generation rather than checking all of the domains itself. So it would be using your project and this one together, basically. So I think a PR to this project to apply your whitelist as the default source whitelist would still be a good idea just so people don't have to run around to multiple repos and combine things and whatnot, When @StevenBlack generates his files, your whitelist would already be applied to all of them with minimal impact to generation times.

And I am sure this goes without saying, but if we were to trust any single person with the repo's default whitelist, it would definitely be you, @mitchellkrogza! And we will just trust you and @funilrys to tag team the logistics of how it gets generated.

funilrys commented 6 years ago

Good night to everybody reading thing now, Good morning for those who are sleeping now, Hello World to everybody else reading this thread and thanks to @ScriptTiger for starting it.

To answer @ScriptTiger's question, I'm certainly someone shy but I prefer to be prudent about the usage of Funceble against this repository as it's composed of many sources. That's mainly why I started working on dead-host a few days after the first release (1.0.0) of Funceble. I started updating it with my own server and computers before I meet @mitchellkrogza who explained me the basic of Travis CI (which at the time I didn't know about).

Today, we are in 1.4.0, 2.0.0 is coming soon and dead-host is working 100% with Travis CI. So, a lot of work has been done in almost 8 months and we can note that 73% of the source listed in this repository are present into dead-host.

Why not 100%? Why does dead-host look like abandoned?

Because I was busy solving some private problems and at the same time I was programming what would be the next release of Funceble along with other open source projects I'm involved in.

Please note: I started talking about Funceble at #259 but we ended with the conclusion that Funceble wasn't really good at that time and I was really bad at explaining its purpose ...

So that was for the context let's answer more deep questions.


Would it make sense to integrate funceble into this repo and cleanse the aggregate as part of the hosts generation every time?

It sure would make sense to integrate funceble into this repository but as mentioned before, I think that the structuration of this repository (pull from source) break the sense that we can give to a usage of Funceble. Indeed, I read all issues posted here since I started Funceble and there are several discussions about copyright. Steven @StevenBlack always ended with the conclusion that we shouldn't (and it's normal) break copyright notes, comments and or warnings about the usage of the original list (or source if you prefer). So we have to find a usage and in the same time, we have to respect the differents copyrights ... But a compromise can be found. I don't know how but let's imagine that everything is possible.

You can note that in the 73% of present into dead-host, 73% (and I didn't said 100%) updated their list at least once according to Funceble or dead-host results.


PR needed/wanted?

I think that Steven @StevenBlack, will say that A PR is always welcome but should respect copyrights and should also not break anything that already exists.


It seems like the resultant files would just get an added layer of curation

At the time I write this 1.4.0 is great but 2.0.0 would and is greater because I review almost everything and I'm still not done with the optimisation. Otherwise, 2.0.0 is more precise and handle many cases that I wouldn't think about without Mitch's @mitchellkrogza tests. So yes when using 2.0.0 it does add layer of curation...


I'm not sure how much they can vary the generation times, or any other specs that @StevenBlack might be interested in tweaking to be in keeping with the project's core purposes.

At the time we discuss the only comments from Steven @StevenBlack was into #259 when he said (about Funceble):

I think whois is the weak link in all this. Not much can be done about that. Maybe the flow should bypass whois if it times out or returns an error.

So that's mainly why the default timeout (which have been introduced after our discussions) for almost every command has been changed from 30 seconds to 1 second.
But compared to @pi-hole or @mitchellkrogza's Ultimate.Hosts.Blacklist the time of generation should be less than a day using Travis CI but @mitchellkrogza can confirm that point as he is an excellent "Travis CI breaker" and user.


Could anyone that has experience with this provide estimates for the efficacy of this if run on this repo? Like generation times, diff percentage of cleansed to original, etc.?

What we can do if Steven @StevenBlack want or if the audience mentioned here want to know, I can add and test the unified hosts list with dead-host so we could exactly know how much time does it takes.


Instead of re inventing the wheel why not just contribute to the central repo and use this one source for controlling removal of dead / inactive domains, false positives and whitelisted domains?

The only problem I can see there is at the time I write this, we have domains such as facebook.com or even cnn.com listed in there so maybe as suggested by @ScriptTiger working with extension after a restructuration may be a better way to work. (A new central repo only for this repository ?)


If you read everything thank you and you deserve a :1st_place_medal:, for others: well you should read everything to understand but I still can answer every question one by one:+1:

@ScriptTiger @mitchellkrogza @StevenBlack as a conclusion a lot of work, reflexion, conception and for sure a lot of reading is waiting for us but I hope that we are going to find a solution to the problematic announced by @ScriptTiger and others who think in the same way or want the "cleanest list ever".

ScriptTiger commented 6 years ago

I think an easy way to get around deleting entries if you are worried about copyright would just be to automatically append "#" to the front of the line and comment them out as part of the script. So all of the data is intact, just not necessarily active.

And we totally understand you're still in development, making integration with another project somewhat of a hassle. In this case, yourself and @mitchellkrogza can just handle the generation separately and just make the resultant unified hosts whitelist available for this project, as stated in my second post. So your development can go on as usual and we are just using the end product file and not the scripts and no one in this repo has to get involved in the whitelist generation process. And then later on, once you're more established, we can revisit the integration topic at that time.

mitchellkrogza commented 6 years ago

@ScriptTiger as always a good discussion here.

I put a lot of thought and effort into a false positives control and whitelisting domains system and came up with the Central Repo idea which @StevenBlack commended. I've put 100's of hours now into testing lists to make sure that the centralized list is let's say 99.9% accurate.

Had a lot of help along the way and insight too from @funilrys with his amazing funceble script, who never ceases to implement any crazy and insane new feature I ask for. He had never heard of Travis as he mentioned and I helped him along getting into it .... one day I threw a crazy insane idea at him to implement a travis autosave feature, something seemingly impossible, sure enough within 7-10 days he had it done and I continually commend him for his excellent programming skills.

Also had a lot of help and insight from @maravento who included several of my projects into BlackWeb and through his sheer generosity and time he has helped me scale down the CENTRAL REPO into a much more sensible format and much better controlled list.

So I am not saying which is the best approach here or not, I can say adding funceble into this repo and having it run tests which are already being run on dead hosts and various other testing repo's of mine. It might just be wasted time and effort, and also re-doing what's already been done.

@StevenBlack already has an issue with this repo size which has grown and I can guarantee with regular and continuous funceble testing it will grow into Gb's before he knows it. My Ultimate hosts repo grew to a whopping 3.8 Gb which I had to do some serious cleaning on using BFG Repo Cleaner and have to regularly clean and prune due to the sheer magnitude of the file sizes involved.

ScriptTiger commented 6 years ago

So if we just used your central repo as the source for a whitelist, similar to how this repo sources hosts files for extensions, and didn't integrate with funceble at all and only used your whitelist, the whitelist itself would be manageable, yes? If we did this, as @funilrys stated:

Instead of re inventing the wheel why not just contribute to the central repo and use this one source for controlling removal of dead / inactive domains, false positives and whitelisted domains?

The only problem I can see there is at the time I write this, we have domains such as facebook.com or even cnn.com listed in there so maybe as suggested by @ScriptTiger working with extension after a restructuration may be a better way to work. (A new central repo only for this repository ?)

@mitchellkrogza, will this be easy for you to implement?

mitchellkrogza commented 6 years ago

@ScriptTiger I can implement anything we like or need.

The reason for facebook, cnn etc being in the central repo is they are whitelisted and innocent domains.

I don't foresee any reason the should be on any block lists but I can foresee some parents simply don't want their kids on facebook, twitter and other social media sites.

I've seen a lot of innocent domains falsely listed on my input list of now 29 data sources on Ultimate Hosts and through a lot of help from @xxcriticxx he has helped me fine tune whitelisting and false positives.

It's a tricky one getting whitelisting and false positives accurate and a hosts file suitable to all people's needs.

mitchellkrogza commented 6 years ago

@ScriptTiger I could very easily add a _Steven_Black folder into the CENTRAL REPO clone what is currently in the main REPO and then together we can fine tune it for @StevenBlack repo ???? It will have it's own set of generator scripts so any pull requests and changes will only modify the _Steven_Black version of the files ??? Your ideas ? It would take me a mere 20 minutes to set this up.

ScriptTiger commented 6 years ago

@StevenBlack, does this sound good to you? Applying @mitchellkrogza's whitelist automatically to the hosts files generated in this repo? I think the "whitelist" in the root directory should be left as is and empty for users to customize their own lists, but that @mitchellkrogza's whitelist should be imported to the repo similar to how the various source hosts files are imported and pulled from to generate the hosts files. The only difference is matching entries will be commented out/deleted, whatever we decide on for licensing purposes, instead of actually adding any new entries as the other sources do. This would be the final scrub layer of the process and you can use the same general whitelist for all combinations/extensions of the unified hosts files since dead domains are dead domains no matter what list they are on or not on.

And just as a recap, the whitelist @mitchellkrogza will be generating is a list of all dead domains from the most restrictive hosts file here, since it has all possible domains. That hosts file currently being this one:
https://raw.githubusercontent.com/StevenBlack/hosts/master/alternates/fakenews-gambling-porn-social/hosts

For now we can just wait for @mitchellkrogza to get the whitelist set up, and then we can do some manual tests on it to check for efficacy and see the impact it has on your hosts files. And then if everything checks out for you, we can roll it out to be automatically imported and applied by the scripts.

As far as my personal opinion on the whole commenting/deleting debate over what happens to dead domain entries, I think deleting them should be fine. The sources are still given credit, we are just making functional edits. Or we can tie this into https://github.com/StevenBlack/hosts/issues/410 and give people the option of commenting out or deleting. Right now matching entries from multiple sources are deleted, yes? So I think just continuing that should be fine here as they are functional edits.

StevenBlack commented 6 years ago

@ScriptTiger @mitchellkrogza I'll think about it.

I'd be interested to know, approximately, how many domain rejections this would imply. Need to know this, roughly, to assess the risk of type II errors (false rejections).

ScriptTiger commented 6 years ago

I'm sure @mitchellkrogza can get back to you on that, but that does bring up an interesting concern. Maybe adding other statistics to the central repo might help to decrease the possibility of false positives? Maybe we can keep a "last detected" date and "consecutive missed detections" count on file and decrease the probability of false positives in the whitelist by implementing thresholds. Like only websites not detected 3 times or that haven't been detected in a month are whitelisted. Also knowing the frequency with which @mitchellkrogza plans to update the list would also play into this.

mitchellkrogza commented 6 years ago

@ScriptTiger @StevenBlack I have spent the better part of this morning improving the stats on the front page as per latest screen shot below. I will now clone the main repo into a subfolder which be an exact clone except with its own build scripts and the ability for you all to send PR's to the Steven.Black section of the repo and it will rebuild and add / remove whatever was sent in the PR without affecting the main repo.

@StevenBlack I think you will see now from the stats below how many domains could be stripped but I can assure you each list has been tested up to 5 times already to confirm accuracy and @maravento who runs the BlackWeb project has approved it's accuracy after his own extensive testing.

Basically just because it says Total Dead Domains: 27,291 does not mean it will strip out 27,291 domains from your hosts file when you generate it. It will only strip (or comment out) those that exist when you re-generate your hosts file.

My personal opinion in your initial testing of this would be to comment out the entries with a simple # and once you are happy after a while then just delete them on your re-gen instead of filling up the hosts files with entries that are just taking up space.

@ScriptTiger i do intend to improve the stats on the front page even more with a "Date Last Tested" and "Date Last Updated" as I regularly re-test those lists and update them accordingly if any appear to become re-active.

screen shot 2017-10-06 at 10 15 15 am

ScriptTiger commented 6 years ago

I wasn't really thinking about global statistics, but that is pretty awesome and I'm sure that will definitely help @StevenBlack with his comparisons. I was thinking more about per domain statistics stored in the whitelist file itself alongside each domain in separate columns.

Tab delimited:

domain last_detected missed_count lb.usemaxserver.de 10-06-2017 3

CSV:

domain,last_detected,missed_count lb.usemaxserver.de,10-06-2017,3

Since the whitelist itself is not a hosts file, it doesn't need to be in hosts file format and we can enhance it all we want, even JSON formats. The above is just an example which is similar to how Tor tracks exit nodes, but obviously our purposes are to track dead nodes and not live ones.

mitchellkrogza commented 6 years ago

Thanks @ScriptTiger what I tend to do instead is re-test each file so like next week I will send all the dead .doubleclick. domains list off to my other test repo's for re-testing and then if anything changes the actual file dead-domains-doubleclick.txt will be updated and next time any repo pulls during their next build it will once again only strip out what's truly dead.

It's going to be a lot of work but I am in the process of building these tests into the CENTRAL REPO itself so I can trigger tests and then produce the results and last updated / tested information but currently having major issues with TravisCI and GIT with the way it clones and messes up all the file dates and times.

I'm sure we can produce a .csv type file for each test that's run on each list.

ScriptTiger commented 6 years ago

We'll only need that if @StevenBlack decides we need further false positive prevention, but for now we'll just see what he thinks about the comparison of the raw whitelist without added statistics. I know you are easily excitable and will probably start programming a statistics package in R, but I am just looking out for what little sleeping hours you seem to keep.

StevenBlack commented 6 years ago

@ScriptTiger @mitchellkrogza I'm far from sold on the idea that this repo needs this refinement.

Firstly I trust curators to keep their lists clean.

Second, so much can happen server-side that we can't test-for. I've seen and used many products, both native and extensions, that route requests in many different ways that the client-side can never account-for.

Third, with the exploding number of TLDs, all independently administered and run, we can never say any given TLD is playing by the rules.

Fourth, given the multitude of potential edge endpoints available through cloud providers, it's easy to imagine that testing domains from the USA or Canada is going to potentially give you different routings, including no routing at all, compared to testing from Europe, Africa or Asia.

Fifth the performance benefit of filtering domains is minimal. There's even less benefit from commenting domains since this increases host file size.

Finally, there's the risk of type-II errors.

I'm just not seeing the benefits of doing this. So some host records are dead. So what? They could become un-dead with the flick (or config) of a switch.

I trust the curators who put them there in the first place. I trust that they'll determine when it's appropriate to cull their lists.

mitchellkrogza commented 6 years ago

@StevenBlack :+1: no problem and I fully understand. I think it's a matter of @funilrys convincing people to fix their lists as there truly is a lot of lists out there with a lot of invalid and dead host names.

ScriptTiger commented 6 years ago

Oh well. Good discussion though! Maybe @funilrys and @mitchellkrogza might be able to use some of these ideas internally for their own purposes.

ScriptTiger commented 6 years ago

While I did close this topic, @mitchellkrogza, @funilrys, because I have to agree with @StevenBlack's overwhelming counter, I will make one final suggestion as to how funceble might be expanded upon. As far as reaching any given domain from any given route vector, the easiest way to eliminate false positives related to this would be to send multiple tests from multiple origins. Tor allows you to specify any specific country code you want as an exit node and this could be easily scripted for a cyclical check. You can also look into the SoftEther VPN project. Although integrating with SoftEther would take a bit more coding, it would eliminate false positives in the case of using Tor exit nodes, as Tor exit nodes are public knowledge and are commonly blocked by various entities. But using either one of the above services, you could cyclically check a random sample of countries, strategic countries from specific regions, or strategic countries related to global network routings.

funilrys commented 6 years ago

Hello @ScriptTiger, Hello everybody else and I wish you all a happy and successful new year.

First, sorry for the ping, I wanted to give an update here. I'm thinking about the integration of the ideas you mentioned, but as you also mentioned it will take some time which I don't actually have.

Meanwhile, as a beginning, I included into PyFunceble (Yes it's Funceble under Python) a feature which saves all non ACTIVE domains into a file called inactive-db.json. This way when we retest the same file path the next day, all domains listed into inactive-db.json will be retested. I think that this opens a door for a possible continuous test and cleaning of hosts files until I get my fingers dirty with the other idea we discussed here.

Please note that I did not abandon Funceble I just wanted to improve my Python skills for my personal and study experience so for now all my attention is under PyFunceble which is actually as usable as Funceble.

Cheers, Nissar.

P.S: Feel free to leave me all your advice about my programming style or other things that you think about PyFunceble here or under a new issue. Thanks in advance and have a nice day/night!