StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
26.65k stars 2.21k forks source link

What do you think of a list like this? #1956

Closed jawz101 closed 2 years ago

jawz101 commented 2 years ago

There are several ad/tracking/marketing campaign companies that use businesscustomer.acmeadco.com style of entries, difficult to maintain and clutter up lists. While many ad blockers are capable of wildcarding these sorts of domains, a host file list cannot.

Instead, this list is the Cisco Umbrella Top 1 Million daily list and pulls out the most popular lookups for these domains

https://github.com/jawz101/subdomain_blocklists

StevenBlack commented 2 years ago

Hi @jawz101 a quick high-level look (see below).

This would add almost 20,000 domains to our base list, increasing its bulk by (20,673 - 872) / 108,831 = 18.1%.

This is a heavy cost considering our list tries to straddle the middle ground between too-small to be much good, and too-large for some applications like, incidentally, Microsoft Windows.

It's tempting though. How is the list curated, do you know?

$ ghosts -c https://raw.githubusercontent.com/jawz101/subdomain_blocklists/main/hosts.txt 
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
Domains: 108,831
Bytes: 3.4 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/jawz101/subdomain_blocklists/main/hosts.txt
Domains: 20,673
Bytes: 775 kB
Intersection: 872 domains
jawz101 commented 2 years ago

It's something I just threw together based on familiar ad companies which use the sort of naming convention

I based it on the DNS requests Cisco actively report that their customers of the Umbrella/OpenDNS users look up every day on Cisco's DNS product

Cisco Umbrella DNS service https://umbrella.cisco.com/products/recursive-dns-services

public daily Top 1 Million list they provide http://s3-us-west-1.amazonaws.com/umbrella-static/index.html

Most of the source lists in the Unified blocklist are stale so I use these reports to occasionally clean up the Adaway list. Like if ad companies go out of business or shut down servers, there's no reason for the list to block it.

It's just an experiment for myself but I figure I'd mention it. It looks like the Unified list has grown by 40,000 over the past few months so I understand wanting to keep it smaller.

Closing the issue since I really only wanted to chat

StevenBlack commented 2 years ago

I want to keep this open a bit longer @jawz101 so it stays on my radar.

I'm presently writing a tool to assess how hosts sources contribute to the Unified list because I'm considering abandoning stale sources. But first I want to systematically know, what do we lose? What's the overlap covered by the other components, net of the removal candidate? I'd also love to know the list of specific domain gains and losses from release to release. And tracking the size of components over time...

jawz101 commented 2 years ago

sidenote: I compared the source lists for the current Steven Black Unified Hosts file in the data folders to the most recent Cisco Umbrella (OpenDNS) Top 1 Million DNS lookups for today. This is how I evaluate the Adaway list on a routine basis.

In other words, 99.84% of the 50k entries on the KADhosts list were not looked up yesterday by the millions of devices that use the Cisco Umbrella DNS product.

Not factoring in entries appearing on multiple lists- this is just one way to view them. I personally think a list can be < 20,000 entries and be effective.

LIST NOT IN TOP 1 MILLION IN TOP 1 MILLION # OF ENTRIES PERCENT
adaway 512 6,526 7,038 92.73%
Adguard-cname 19,317 2,720 22,037 12.34%
mvps 7,096 1,633 8,729 18.71%
yoyo 2,413 1,263 3,676 34.36%
someonewhocares 9,138 1,237 10,375 11.92%
tiuxo 1,143 587 1,730 33.93%
hostsVN 1,354 438 1,792 24.44%
StevenBlack 1,700 424 2,124 19.96%
add 3,291 256 3,547 7.22%
shady-hosts 124 236 360 65.56%
KADhosts 50,127 81 50,208 0.16%
Badd-Boyz-Hosts 1,373 11 1,384 0.79%
URLHaus 1,159 5 1,164 0.43%
minecraft-hosts 4 2 6 33.33%
UncheckyAds 9 9 0.00%
MetaMask 1,071 1,071 0.00%
TOTAL 99,831 15,419 115,250 13.38%
StevenBlack commented 2 years ago

That's very interesting @jawz101.

Admittedly the top 1-million is a us-centric, CISCO-specific thing.

It would be interesting to see a .TLD breakdown of the top 1-million, and compare it to KADHosts, since that's the one you mention.

$ ghosts --tld -m kadhosts    

----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/PolishFiltersTeam/KADhosts/master/KADhosts.txt
Domains: 51,130
Bytes: 1.6 MB
TLD tally:  (231 unique TLD)
   com: 12,127
   pl: 7,326
   xyz: 5,822
   net: 5,042
   site: 4,227
   info: 2,915
   eu: 1,231
   space: 1,070
   app: 1,008
   online: 645
   shop: 517
   top: 418
   co: 416
   org: 382
   icu: 374
   website: 352
   me: 316
   biz: 306
   club: 296
   click: 274
   cyou: 246
   pw: 238
   bar: 235
   live: 228
   us: 226
   ru: 226
   rest: 208
   work: 192
   io: 162
   store: 158
   tk: 154
   ml: 148
   dev: 144
   se: 140
   pro: 122
   cc: 114
   fun: 110
   in: 108
   link: 104
   tech: 102
   buzz: 94
   ga: 94
   cf: 92
   win: 90
   ir: 86
   pics: 84
   cloud: 74
   br: 72
   gq: 66
   life: 66
   mom: 60
   host: 60
   de: 54
   at: 52
   casa: 52
   sbs: 50
   one: 46
   uk: 46
   nl: 42
   it: 42
   cn: 40
   gd: 40
   uno: 38
   beauty: 36
   sh: 36
   digital: 30
   ws: 28
   ng: 28
   today: 28
   fr: 28
   trade: 28
   mobi: 26
   world: 26
   gift: 26
   vn: 22
   tv: 22
   id: 20
   fyi: 20
   au: 20
   cam: 20
   su: 20
   lol: 18
   jp: 18
   blog: 18
   ua: 18
   quest: 18
   codes: 18
   loan: 16
   ca: 16
   cl: 16
   autos: 14
   dk: 14
   ltd: 14
   art: 14
   email: 14
   sv: 14
   tr: 12
   il: 12
   page: 12
   cz: 12
   es: 12
   vip: 12
   to: 10
   pk: 10
   auction: 10
   stream: 10
   monster: 10
   care: 10
   vu: 10
   works: 10
   my: 10
   mx: 10
   network: 8
   cfd: 8
   bond: 8
   hu: 8
   ro: 8
   guru: 8
   news: 8
   best: 8
   capital: 8
   pt: 6
   goog: 6
   gr: 6
   reviews: 6
   ph: 6
   software: 6
   lu: 6
   tw: 6
   ovh: 6
   cards: 6
   bid: 6
   ar: 6
   ai: 4
   ink: 4
   be: 4
   tn: 4
   kim: 4
   sk: 4
   gg: 4
   help: 4
   group: 4
   review: 4
   pe: 4
   tube: 4
   za: 4
   kr: 4
   press: 4
   design: 4
   support: 4
   ch: 4
   business: 4
   social: 4
   exchange: 4
   money: 4
   date: 4
   vc: 2
   asia: 2
   bz: 2
   trading: 2
   lv: 2
   team: 2
   exposed: 2
   mk: 2
   mr: 2
   im: 2
   name: 2
   ae: 2
   bg: 2
   rodeo: 2
   engineer: 2
   photography: 2
   solutions: 2
   so: 2
   by: 2
   surf: 2
   ms: 2
   center: 2
   cool: 2
   mn: 2
   miami: 2
   ao: 2
   wang: 2
   credit: 2
   rs: 2
   plus: 2
   rw: 2
   qa: 2
   delivery: 2
   th: 2
   fo: 2
   fans: 2
   bet: 2
   property: 2
   cm: 2
   school: 2
   ly: 2
   ie: 2
   sx: 2
   video: 2
   ci: 2
   international: 2
   mw: 2
   pm: 2
   ceo: 2
   np: 2
   vision: 2
   fund: 2
   academy: 2
   global: 2
   earth: 2
   la: 2
   ee: 2
   md: 2
   bj: 2
   nz: 2
   technology: 2
   fi: 2
   gifts: 2
   energy: 2
   kz: 2
   lt: 2
   si: 2
   ps: 2
   gt: 2
   is: 2
   coffee: 2
   re: 2
   wf: 2
   studio: 2
   inf: 1
----------------------------------------
jawz101 commented 2 years ago

I do not understand the significance of the TLD thing. How do you interpret it?

StevenBlack commented 2 years ago

@jawz101 the TLD breakdown gives us a sense of global coverage.

Let's look at Adaway. That's a much different mix of TLDs. KADHosts provides much more coverage of Europe and Eastern Europe.

I like the TLD view because it's a different way to slice things.

It's hard to draw definitive conclusions about quality based on just TLDs.

I presume most independent malicious actors would certainly not be among the top million, and perhaps may have propensity for small-country or otherwise exotic TLD. That's just a guess.

ghosts --tld -m https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt                                                                                                                                                                                             

----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt
Domains: 7,038
Bytes: 263 kB
TLD tally:  (78 unique TLD)
   com: 5,228
   net: 875
   io: 221
   cn: 95
   tv: 67
   co: 66
   jp: 56
   vn: 51
   org: 39
   ru: 30
   uk: 23
   mobi: 19
   st: 18
   fi: 14
   la: 13
   cc: 13
   me: 13
   de: 12
   ai: 11
   kr: 9
   info: 9
   site: 8
   pl: 8
   xyz: 7
   in: 7
   asia: 7
   eu: 7
   gt: 7
   us: 6
   im: 6
   it: 5
   ca: 5
   biz: 4
   tr: 4
   network: 4
   br: 4
   my: 3
   world: 3
   to: 3
   zone: 3
   ir: 3
   link: 3
   ly: 3
   be: 3
   am: 2
   fr: 2
   hk: 2
   life: 2
   sg: 2
   ms: 2
   tech: 2
   ua: 2
   ad: 2
   works: 1
   store: 1
   lt: 1
   delivery: 1
   al: 1
   app: 1
   bid: 1
   ki: 1
   video: 1
   fm: 1
   gg: 1
   rocks: 1
   ph: 1
   nl: 1
   cloud: 1
   tw: 1
   su: 1
   no: 1
   systems: 1
   es: 1
   se: 1
   at: 1
   mx: 1
   watch: 1
   tk: 1
----------------------------------------
StevenBlack commented 2 years ago

@jawz101 here's a ghosts report on the top 1-million against our default amalgamated list. A 1.3% overlap.

I would say, based on this, the top 1-million domains lists is heavily biased towards clean actors.

$ ghosts -c /Users/steve/Downloads/top-1m.txt                                                                                                                                                                                                                                            

----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
Domains: 109,880
Bytes: 3.4 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: /Users/steve/Downloads/top-1m.txt
Domains: 999,295
Bytes: 24 MB
Intersection: 13,081 domains
StevenBlack commented 2 years ago

@jawz101 the full 1-million TLD breakdown is in this Gist: https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b

Here's the top few lines of the report. Yeah this appears very heavily biased to the USA.

Scroll to the bottom of that Gist. Some crazy and implausible TLDs in that list, shedding some doubt about its quality.

That kinda supports a basic premise: large lists are not curateable, so (in general) they aren't curated.

$ ghosts --tld -m /Users/steve/Downloads/top-1m.txt   

----------------------------------------
Base hosts file summary:
----------------------------------------
Location: /Users/steve/Downloads/top-1m.txt
Domains: 999,295
Bytes: 24 MB
TLD tally:  (1,181 unique TLD)
   com: 604,807
   net: 149,547
   org: 30,184
   io: 16,843
   uk: 14,306
   de: 10,086
   cn: 8,119
   ru: 7,364
   co: 6,156
   edu: 6,074
   gov: 5,924
   us: 5,917
   br: 4,374
   xyz: 4,061
   me: 3,937
   jp: 3,820
   tv: 3,809
   nl: 3,743
   fr: 3,627
   vn: 3,497
   ca: 3,374
   it: 3,273
   internal: 3,255
   cloud: 2,756
   pl: 2,528
   mx: 2,401
   eu: 2,278
   info: 2,099
...
jawz101 commented 2 years ago

The USA's top level domain is .us

.com is for commercial companies, regardless of country. Same with .net, .info, .biz, .io.

.org is generally used for non-profits, open source projects, & communities

StevenBlack commented 2 years ago

@jawz101 lol 😆

And .gov? .edu? One-hundred percent USA.

That .gov and .edu are the same order of magnitude as .cn tells me, this is VERY heavily USA biased.

jawz101 commented 2 years ago

https://www.statista.com/statistics/918403/number-of-universities-worldwide-by-country/

If you go to university in the U.S. a large chunk of students are international. Like a lot. And with the u.s. being the 3rd largest country, I scale America's union if states to Europe's union of countries

StevenBlack commented 2 years ago

@jawz101 that's just not acceptable. I'm not gonna stand for that.

KADHosts is based in Poland. They're really strong on threats in that part of the world. HostsVN is based in Vietnam. They are really strong on threats based in that locality and surrounding area.

These are strengths, not weaknesses.

You can't gauge what we do here relative to a "Top 1-million" list from CISCO. That's nonsense, and I think the numbers clearly bear this out. I see zero evidence that comparisons to the "Top 1-million" list tells us anything.

Let's get real. Population of India: 1.38 billion (2020). Total number of .in domains in the "Top 1-million" list: 1,949, about the same as Canada, with 1/50th the population. The "Top 1-million" list is grossly US-centric and, arguably, it's bullshit.

jawz101 commented 2 years ago

I have no idea why you're making it into whatever this turned out to be so I'll bow out.

edit: I will say that it's silly you're acting like I have some American exceptionalism thing. A few years ago the Steven Black list was maybe 65,000 entries and now it's about twice as large. Back to the post, I'm just saying Cisco Umbrella (formerly and still OpenDNS), has peering partners such as Baidu, Alibaba, & British Telecom. as Furthermore, I regard Bigdargon's Vietnamese list and AdGuard's lists (a Russian/Cyprus/multinational company) very high quality as well. Just go back to my original post from earlier today. If you want to interpret that as something else then I respect that. I just don't think some trim the old stuff that is otherwise dormant.

ler762 commented 2 years ago

On 6/16/22, Steven Black wrote:

@jawz101 the full 1-million TLD breakdown is in this Gist: https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b

Here's the top few lines of the report. Yeah this appears very heavily biased to the USA.

Scroll to the bottom of that Gist. Some crazy and implausible TLDs in that list, shedding some doubt its quality.

Take another look at the list description: "The popularity list contains our most queried domains based on passive DNS usage across our Umbrella global network of more than 100 Billion requests per day with 65 million unique active users, in more than 165 countries."

People request name lookups on crazy and implausible names, so you get crazy and implausible names in the list. See, for example, https://icannwiki.org/.home and then look at how many names ending with ".home" are in the list.

jawz101 commented 2 years ago

The same can be said for this hosts file.

these entries on the currrent StevenBlack Unified list are invalid TLD's

0.0.0.0 fe 0.0.0.0 ff 0.0.0.0 inf 0.0.0.0 pgl.example 0.0.0.0 www.inf 0.0.0.0 castoola.tv.lan

... but to me, it says a lot that it was more common for someone to try and look up .home name and show up on a top 1 million list than request some of the ones on the StevenBlack list that do not show up on a top 1 million list. If that makes sense.