Closed jawz101 closed 2 years ago
Hi @jawz101 a quick high-level look (see below).
This would add almost 20,000 domains to our base list, increasing its bulk by (20,673 - 872) / 108,831 = 18.1%
.
This is a heavy cost considering our list tries to straddle the middle ground between too-small to be much good, and too-large for some applications like, incidentally, Microsoft Windows.
It's tempting though. How is the list curated, do you know?
$ ghosts -c https://raw.githubusercontent.com/jawz101/subdomain_blocklists/main/hosts.txt
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
Domains: 108,831
Bytes: 3.4 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/jawz101/subdomain_blocklists/main/hosts.txt
Domains: 20,673
Bytes: 775 kB
Intersection: 872 domains
It's something I just threw together based on familiar ad companies which use the sort of naming convention
I based it on the DNS requests Cisco actively report that their customers of the Umbrella/OpenDNS users look up every day on Cisco's DNS product
Cisco Umbrella DNS service https://umbrella.cisco.com/products/recursive-dns-services
public daily Top 1 Million list they provide http://s3-us-west-1.amazonaws.com/umbrella-static/index.html
Most of the source lists in the Unified blocklist are stale so I use these reports to occasionally clean up the Adaway list. Like if ad companies go out of business or shut down servers, there's no reason for the list to block it.
It's just an experiment for myself but I figure I'd mention it. It looks like the Unified list has grown by 40,000 over the past few months so I understand wanting to keep it smaller.
Closing the issue since I really only wanted to chat
I want to keep this open a bit longer @jawz101 so it stays on my radar.
I'm presently writing a tool to assess how hosts sources contribute to the Unified list because I'm considering abandoning stale sources. But first I want to systematically know, what do we lose? What's the overlap covered by the other components, net of the removal candidate? I'd also love to know the list of specific domain gains and losses from release to release. And tracking the size of components over time...
sidenote: I compared the source lists for the current Steven Black Unified Hosts file in the data folders to the most recent Cisco Umbrella (OpenDNS) Top 1 Million DNS lookups for today. This is how I evaluate the Adaway list on a routine basis.
In other words, 99.84% of the 50k entries on the KADhosts list were not looked up yesterday by the millions of devices that use the Cisco Umbrella DNS product.
Not factoring in entries appearing on multiple lists- this is just one way to view them. I personally think a list can be < 20,000 entries and be effective.
LIST | NOT IN TOP 1 MILLION | IN TOP 1 MILLION | # OF ENTRIES | PERCENT |
---|---|---|---|---|
adaway | 512 | 6,526 | 7,038 | 92.73% |
Adguard-cname | 19,317 | 2,720 | 22,037 | 12.34% |
mvps | 7,096 | 1,633 | 8,729 | 18.71% |
yoyo | 2,413 | 1,263 | 3,676 | 34.36% |
someonewhocares | 9,138 | 1,237 | 10,375 | 11.92% |
tiuxo | 1,143 | 587 | 1,730 | 33.93% |
hostsVN | 1,354 | 438 | 1,792 | 24.44% |
StevenBlack | 1,700 | 424 | 2,124 | 19.96% |
add | 3,291 | 256 | 3,547 | 7.22% |
shady-hosts | 124 | 236 | 360 | 65.56% |
KADhosts | 50,127 | 81 | 50,208 | 0.16% |
Badd-Boyz-Hosts | 1,373 | 11 | 1,384 | 0.79% |
URLHaus | 1,159 | 5 | 1,164 | 0.43% |
minecraft-hosts | 4 | 2 | 6 | 33.33% |
UncheckyAds | 9 | 9 | 0.00% | |
MetaMask | 1,071 | 1,071 | 0.00% | |
TOTAL | 99,831 | 15,419 | 115,250 | 13.38% |
That's very interesting @jawz101.
Admittedly the top 1-million is a us-centric, CISCO-specific thing.
It would be interesting to see a .TLD breakdown of the top 1-million, and compare it to KADHosts, since that's the one you mention.
$ ghosts --tld -m kadhosts
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/PolishFiltersTeam/KADhosts/master/KADhosts.txt
Domains: 51,130
Bytes: 1.6 MB
TLD tally: (231 unique TLD)
com: 12,127
pl: 7,326
xyz: 5,822
net: 5,042
site: 4,227
info: 2,915
eu: 1,231
space: 1,070
app: 1,008
online: 645
shop: 517
top: 418
co: 416
org: 382
icu: 374
website: 352
me: 316
biz: 306
club: 296
click: 274
cyou: 246
pw: 238
bar: 235
live: 228
us: 226
ru: 226
rest: 208
work: 192
io: 162
store: 158
tk: 154
ml: 148
dev: 144
se: 140
pro: 122
cc: 114
fun: 110
in: 108
link: 104
tech: 102
buzz: 94
ga: 94
cf: 92
win: 90
ir: 86
pics: 84
cloud: 74
br: 72
gq: 66
life: 66
mom: 60
host: 60
de: 54
at: 52
casa: 52
sbs: 50
one: 46
uk: 46
nl: 42
it: 42
cn: 40
gd: 40
uno: 38
beauty: 36
sh: 36
digital: 30
ws: 28
ng: 28
today: 28
fr: 28
trade: 28
mobi: 26
world: 26
gift: 26
vn: 22
tv: 22
id: 20
fyi: 20
au: 20
cam: 20
su: 20
lol: 18
jp: 18
blog: 18
ua: 18
quest: 18
codes: 18
loan: 16
ca: 16
cl: 16
autos: 14
dk: 14
ltd: 14
art: 14
email: 14
sv: 14
tr: 12
il: 12
page: 12
cz: 12
es: 12
vip: 12
to: 10
pk: 10
auction: 10
stream: 10
monster: 10
care: 10
vu: 10
works: 10
my: 10
mx: 10
network: 8
cfd: 8
bond: 8
hu: 8
ro: 8
guru: 8
news: 8
best: 8
capital: 8
pt: 6
goog: 6
gr: 6
reviews: 6
ph: 6
software: 6
lu: 6
tw: 6
ovh: 6
cards: 6
bid: 6
ar: 6
ai: 4
ink: 4
be: 4
tn: 4
kim: 4
sk: 4
gg: 4
help: 4
group: 4
review: 4
pe: 4
tube: 4
za: 4
kr: 4
press: 4
design: 4
support: 4
ch: 4
business: 4
social: 4
exchange: 4
money: 4
date: 4
vc: 2
asia: 2
bz: 2
trading: 2
lv: 2
team: 2
exposed: 2
mk: 2
mr: 2
im: 2
name: 2
ae: 2
bg: 2
rodeo: 2
engineer: 2
photography: 2
solutions: 2
so: 2
by: 2
surf: 2
ms: 2
center: 2
cool: 2
mn: 2
miami: 2
ao: 2
wang: 2
credit: 2
rs: 2
plus: 2
rw: 2
qa: 2
delivery: 2
th: 2
fo: 2
fans: 2
bet: 2
property: 2
cm: 2
school: 2
ly: 2
ie: 2
sx: 2
video: 2
ci: 2
international: 2
mw: 2
pm: 2
ceo: 2
np: 2
vision: 2
fund: 2
academy: 2
global: 2
earth: 2
la: 2
ee: 2
md: 2
bj: 2
nz: 2
technology: 2
fi: 2
gifts: 2
energy: 2
kz: 2
lt: 2
si: 2
ps: 2
gt: 2
is: 2
coffee: 2
re: 2
wf: 2
studio: 2
inf: 1
----------------------------------------
I do not understand the significance of the TLD thing. How do you interpret it?
@jawz101 the TLD breakdown gives us a sense of global coverage.
Let's look at Adaway. That's a much different mix of TLDs. KADHosts provides much more coverage of Europe and Eastern Europe.
I like the TLD view because it's a different way to slice things.
It's hard to draw definitive conclusions about quality based on just TLDs.
I presume most independent malicious actors would certainly not be among the top million, and perhaps may have propensity for small-country or otherwise exotic TLD. That's just a guess.
ghosts --tld -m https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/AdAway/adaway.github.io/master/hosts.txt
Domains: 7,038
Bytes: 263 kB
TLD tally: (78 unique TLD)
com: 5,228
net: 875
io: 221
cn: 95
tv: 67
co: 66
jp: 56
vn: 51
org: 39
ru: 30
uk: 23
mobi: 19
st: 18
fi: 14
la: 13
cc: 13
me: 13
de: 12
ai: 11
kr: 9
info: 9
site: 8
pl: 8
xyz: 7
in: 7
asia: 7
eu: 7
gt: 7
us: 6
im: 6
it: 5
ca: 5
biz: 4
tr: 4
network: 4
br: 4
my: 3
world: 3
to: 3
zone: 3
ir: 3
link: 3
ly: 3
be: 3
am: 2
fr: 2
hk: 2
life: 2
sg: 2
ms: 2
tech: 2
ua: 2
ad: 2
works: 1
store: 1
lt: 1
delivery: 1
al: 1
app: 1
bid: 1
ki: 1
video: 1
fm: 1
gg: 1
rocks: 1
ph: 1
nl: 1
cloud: 1
tw: 1
su: 1
no: 1
systems: 1
es: 1
se: 1
at: 1
mx: 1
watch: 1
tk: 1
----------------------------------------
@jawz101 here's a ghosts report on the top 1-million against our default amalgamated list. A 1.3% overlap.
I would say, based on this, the top 1-million domains lists is heavily biased towards clean actors.
$ ghosts -c /Users/steve/Downloads/top-1m.txt
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts
Domains: 109,880
Bytes: 3.4 MB
----------------------------------------
Compared hosts file summary:
----------------------------------------
Location: /Users/steve/Downloads/top-1m.txt
Domains: 999,295
Bytes: 24 MB
Intersection: 13,081 domains
@jawz101 the full 1-million TLD breakdown is in this Gist: https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b
Here's the top few lines of the report. Yeah this appears very heavily biased to the USA.
Scroll to the bottom of that Gist. Some crazy and implausible TLDs in that list, shedding some doubt about its quality.
That kinda supports a basic premise: large lists are not curateable, so (in general) they aren't curated.
$ ghosts --tld -m /Users/steve/Downloads/top-1m.txt
----------------------------------------
Base hosts file summary:
----------------------------------------
Location: /Users/steve/Downloads/top-1m.txt
Domains: 999,295
Bytes: 24 MB
TLD tally: (1,181 unique TLD)
com: 604,807
net: 149,547
org: 30,184
io: 16,843
uk: 14,306
de: 10,086
cn: 8,119
ru: 7,364
co: 6,156
edu: 6,074
gov: 5,924
us: 5,917
br: 4,374
xyz: 4,061
me: 3,937
jp: 3,820
tv: 3,809
nl: 3,743
fr: 3,627
vn: 3,497
ca: 3,374
it: 3,273
internal: 3,255
cloud: 2,756
pl: 2,528
mx: 2,401
eu: 2,278
info: 2,099
...
The USA's top level domain is .us
.com is for commercial companies, regardless of country. Same with .net, .info, .biz, .io.
.org is generally used for non-profits, open source projects, & communities
@jawz101 lol 😆
And .gov
? .edu
? One-hundred percent USA.
That .gov
and .edu
are the same order of magnitude as .cn
tells me, this is VERY heavily USA biased.
https://www.statista.com/statistics/918403/number-of-universities-worldwide-by-country/
If you go to university in the U.S. a large chunk of students are international. Like a lot. And with the u.s. being the 3rd largest country, I scale America's union if states to Europe's union of countries
@jawz101 that's just not acceptable. I'm not gonna stand for that.
KADHosts is based in Poland. They're really strong on threats in that part of the world. HostsVN is based in Vietnam. They are really strong on threats based in that locality and surrounding area.
These are strengths, not weaknesses.
You can't gauge what we do here relative to a "Top 1-million" list from CISCO. That's nonsense, and I think the numbers clearly bear this out. I see zero evidence that comparisons to the "Top 1-million" list tells us anything.
Let's get real. Population of India: 1.38 billion (2020). Total number of .in
domains in the "Top 1-million" list: 1,949, about the same as Canada, with 1/50th the population. The "Top 1-million" list is grossly US-centric and, arguably, it's bullshit.
I have no idea why you're making it into whatever this turned out to be so I'll bow out.
edit: I will say that it's silly you're acting like I have some American exceptionalism thing. A few years ago the Steven Black list was maybe 65,000 entries and now it's about twice as large. Back to the post, I'm just saying Cisco Umbrella (formerly and still OpenDNS), has peering partners such as Baidu, Alibaba, & British Telecom. as Furthermore, I regard Bigdargon's Vietnamese list and AdGuard's lists (a Russian/Cyprus/multinational company) very high quality as well. Just go back to my original post from earlier today. If you want to interpret that as something else then I respect that. I just don't think some trim the old stuff that is otherwise dormant.
On 6/16/22, Steven Black wrote:
@jawz101 the full 1-million TLD breakdown is in this Gist: https://gist.github.com/StevenBlack/c08283f99a9c0d2042805e19076b971b
Here's the top few lines of the report. Yeah this appears very heavily biased to the USA.
Scroll to the bottom of that Gist. Some crazy and implausible TLDs in that list, shedding some doubt its quality.
Take another look at the list description: "The popularity list contains our most queried domains based on passive DNS usage across our Umbrella global network of more than 100 Billion requests per day with 65 million unique active users, in more than 165 countries."
People request name lookups on crazy and implausible names, so you get crazy and implausible names in the list. See, for example, https://icannwiki.org/.home and then look at how many names ending with ".home" are in the list.
The same can be said for this hosts file.
these entries on the currrent StevenBlack Unified list are invalid TLD's
0.0.0.0 fe 0.0.0.0 ff 0.0.0.0 inf 0.0.0.0 pgl.example 0.0.0.0 www.inf 0.0.0.0 castoola.tv.lan
... but to me, it says a lot that it was more common for someone to try and look up .home name and show up on a top 1 million list than request some of the ones on the StevenBlack list that do not show up on a top 1 million list. If that makes sense.
There are several ad/tracking/marketing campaign companies that use businesscustomer.acmeadco.com style of entries, difficult to maintain and clutter up lists. While many ad blockers are capable of wildcarding these sorts of domains, a host file list cannot.
Instead, this list is the Cisco Umbrella Top 1 Million daily list and pulls out the most popular lookups for these domains
https://github.com/jawz101/subdomain_blocklists