StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
26.89k stars 2.24k forks source link

TAKE NOTE: Starting to actively track hosts file sizes #2014

Open StevenBlack opened 2 years ago

StevenBlack commented 2 years ago

This is the beginning of something I've been playing with: tracking hosts file sizes

For example:

Size history

I've added the following files to the repo:

More to come but this is what I'll be working on for the next while.

StevenBlack commented 2 years ago

Contribution History of Source Host Lists

Here are the contributions, over time, to this repo.

Updated: July 30, 2022

adaway org Adguard-cname Badd-Boyz-Hosts hostsVN KADhosts mvps org someonewhocares org StevenBlack URLHaus yoyo org porn-brijrajparmar27 porn-clefspeare13 porn-sinfonietta-snuff porn-sinfonietta porn-tiuxo shady-hosts social-sinfonietta social-tiuxo tiuxo UncheckyAds add 2o7Net add Dead add Risk add Spam fakenews gambling
bigdargon commented 2 years ago

Wow! I like this 👍👍

stefanopini commented 1 year ago

Hi @StevenBlack , thanks a lot for continuing working on the project and keeping track of issues such as this one.

The growing hosts file size is a potentially significant problem for Windows users, due to the known issues of the Windows DNS client with large hosts files (e.g. #411 #710 #2138 ).

I thought we could mitigate the issue by trying to resolve all the domains with the default DNS servers and then excluding the ones that are not resolved (assuming they don't exist anymore). I made a small change to your repo (here) and tested the idea.

It took quite a long time (more than 20 minutes, running in parallel), but removed ~150k domains out of ~245k (before duplication removal). The final unique entries were just 33k and the loading time under Windows was back at reasonable values.

What do you think about this approach (excluding domains that don't exist anymore from the final hosts file checking their existence at runtime/keeping track of them)? Do you think it could be a reasonable mitigation to the increasing hosts file size issue? In that case, I could work on polishing the code and submitting a pull request.

StevenBlack commented 1 year ago

Thank you for this Stefano @stefanopini.

I think about this a lot. Hosts file sizes concerns me greatly.

I understand what you're saying, here are additional factors to consider.

Firstly, where possible, I prefer leave list curation to list curators. I trust our list curators.

Secondly we need to consider sleepers. Malware is commonly a 2-stage process:

So think about this: how would you implement a sleeper? What's the easiest possible way? One easy way is your sleeper regularly pings a domain that "doesn't exist", as you say. At some point in the future, you could create a DNS record and suddenly all the sleepers are able to phone home.

You see, some domains exist, some domains don't exist anymore, and some domains don't exist yet. Some are are simply turned off in a variety of plausible ways.

You wanna broadly make this decision based on a snapshot of the present state of DNS? I certainly don't.

I'm not moved to compromise coverage because of MS Windows' shitty engineering. The hosts files here are NOT for Windows users, and won't be constrained by MS Windows. I reckon Windows users have 99-other problems, god help them.

ALL THIS SAID I'm aware that I need to tighten the collection. I hope to soon make data-driven decisions about which lists to include, and which ones to drop, from various amalgamated lists.

stefanopini commented 1 year ago

Thanks Steven for taking time to have a look and answer my comment!

I see your point, I totally agree with you and we shouldn't do that by default for the security reasons you mentioned.

I just want to clarify that my idea was to add an option, for people using this repository on Windows, to reduce the size of the hosts file on their machine, rather than applying it to the hosts file included in this repository. I'm happy to keep the option in my fork of the repo, possibly excluding malware-focused lists from the pruning, in case other people find it useful.

I will continue following this issue for future updates on the included lists and the overall hosts file size, thanks.

dennisvandehoef commented 1 year ago

Heey,

I am quite surprised to find out that we track sizes of the following:

Though, we don't track the size of the extensions.

Is there a specific reason for this?

StevenBlack commented 1 year ago

That's a good question Dennis @dennisvandehoef.

It's because it's expensive. Updating graphs for all the extensions involves calculating and plotting each one, thus increasing the friction of each release by about 15x.

When I first announced this I got mostly crickets back. So that's another factor.

I was mainly interested in the size of the unified list, which accounts for most of the variability of the extension sizes. Because the size of the social and gambling and fakenews extensions don't vary much. And the porn list is potentially infinite, or grows potentially much more, so not that interesting in its absolute value at any given time.

So I settled for what interests me personally, and I haven't heard a peep about this in 10-months.

dennisvandehoef commented 1 year ago

When I first announced this I got mostly crickets back. So that's another factor.

Oh wow, when I started exploring this repository, I really loved looking at the graphs to see how it grew over time, and how well maintained the list is.

Because the size of the social and gambling and fakenews extensions don't vary much. And the porn list is potentially infinite

I think that for the pron list, it is only relevant to see that it keeps growing, even if we remove dead sources. New porn hosts are created every day. So a graph that represents this still shows that our hosts list tries to keep up.

As for gambling, I also use this German orientated merged hosts list https://github.com/RPiList/specials, and they seem to have sources that find new gambling hosts on a regular basis. Also, our gambling source bigdragon seems to be actively maintained. Though according to the file it Only include gambling domains (Vietnamese language) is limited to a certain language, and we might want to look at more sources, which then also would at some point make the graphs more interesting.

It's because it's expensive.

and

I'm sure someone will suggest I use Python 😄 — a pull request for that would be nice.

I want to play a bit more with python anyway. Are you still open for a pull request for this? I will then also try and make it less expensive/time-consuming.