Open StevenBlack opened 2 years ago
Here are the contributions, over time, to this repo.
Updated: July 30, 2022
Wow! I like this 👍👍
Hi @StevenBlack , thanks a lot for continuing working on the project and keeping track of issues such as this one.
The growing hosts file size is a potentially significant problem for Windows users, due to the known issues of the Windows DNS client with large hosts files (e.g. #411 #710 #2138 ).
I thought we could mitigate the issue by trying to resolve all the domains with the default DNS servers and then excluding the ones that are not resolved (assuming they don't exist anymore). I made a small change to your repo (here) and tested the idea.
It took quite a long time (more than 20 minutes, running in parallel), but removed ~150k domains out of ~245k (before duplication removal). The final unique entries were just 33k and the loading time under Windows was back at reasonable values.
What do you think about this approach (excluding domains that don't exist anymore from the final hosts file checking their existence at runtime/keeping track of them)? Do you think it could be a reasonable mitigation to the increasing hosts file size issue? In that case, I could work on polishing the code and submitting a pull request.
Thank you for this Stefano @stefanopini.
I think about this a lot. Hosts file sizes concerns me greatly.
I understand what you're saying, here are additional factors to consider.
Firstly, where possible, I prefer leave list curation to list curators. I trust our list curators.
Secondly we need to consider sleepers. Malware is commonly a 2-stage process:
So think about this: how would you implement a sleeper? What's the easiest possible way? One easy way is your sleeper regularly pings a domain that "doesn't exist", as you say. At some point in the future, you could create a DNS record and suddenly all the sleepers are able to phone home.
You see, some domains exist, some domains don't exist anymore, and some domains don't exist yet. Some are are simply turned off in a variety of plausible ways.
You wanna broadly make this decision based on a snapshot of the present state of DNS? I certainly don't.
I'm not moved to compromise coverage because of MS Windows' shitty engineering. The hosts files here are NOT for Windows users, and won't be constrained by MS Windows. I reckon Windows users have 99-other problems, god help them.
ALL THIS SAID I'm aware that I need to tighten the collection. I hope to soon make data-driven decisions about which lists to include, and which ones to drop, from various amalgamated lists.
Thanks Steven for taking time to have a look and answer my comment!
I see your point, I totally agree with you and we shouldn't do that by default for the security reasons you mentioned.
I just want to clarify that my idea was to add an option, for people using this repository on Windows, to reduce the size of the hosts file on their machine, rather than applying it to the hosts file included in this repository. I'm happy to keep the option in my fork of the repo, possibly excluding malware-focused lists from the pruning, in case other people find it useful.
I will continue following this issue for future updates on the included lists and the overall hosts file size, thanks.
Heey,
I am quite surprised to find out that we track sizes of the following:
Though, we don't track the size of the extensions.
Is there a specific reason for this?
That's a good question Dennis @dennisvandehoef.
It's because it's expensive. Updating graphs for all the extensions involves calculating and plotting each one, thus increasing the friction of each release by about 15x.
When I first announced this I got mostly crickets back. So that's another factor.
I was mainly interested in the size of the unified list, which accounts for most of the variability of the extension sizes. Because the size of the social
and gambling
and fakenews
extensions don't vary much. And the porn
list is potentially infinite, or grows potentially much more, so not that interesting in its absolute value at any given time.
So I settled for what interests me personally, and I haven't heard a peep about this in 10-months.
When I first announced this I got mostly crickets back. So that's another factor.
Oh wow, when I started exploring this repository, I really loved looking at the graphs to see how it grew over time, and how well maintained the list is.
Because the size of the
social
andgambling
andfakenews
extensions don't vary much. And theporn
list is potentially infinite
I think that for the pron list, it is only relevant to see that it keeps growing, even if we remove dead sources. New porn hosts are created every day. So a graph that represents this still shows that our hosts list tries to keep up.
As for gambling, I also use this German orientated merged hosts list https://github.com/RPiList/specials, and they seem to have sources that find new gambling hosts on a regular basis. Also, our gambling source bigdragon seems to be actively maintained. Though according to the file it Only include gambling domains (Vietnamese language)
is limited to a certain language, and we might want to look at more sources, which then also would at some point make the graphs more interesting.
It's because it's expensive.
and
I'm sure someone will suggest I use Python 😄 — a pull request for that would be nice.
I want to play a bit more with python anyway. Are you still open for a pull request for this? I will then also try and make it less expensive/time-consuming.
This is the beginning of something I've been playing with: tracking hosts file sizes
For example:
I've added the following files to the repo:
hosts_file_size_history.png
: the graph above.stats.nb
: a Wolfram Mathematica notebook. I'm sure someone will suggest I use Python 😄 — a pull request for that would be nice.stats.out
: the data file produced by...stats.sh
: the bash file that scours git history. Note there's a dependency on jq.More to come but this is what I'll be working on for the next while.