StevenBlack / hosts

🔒 Consolidating and extending hosts files from several well-curated sources. Optionally pick extensions for porn, social media, and other categories.
MIT License
26.86k stars 2.23k forks source link

Use nine hostnames per line instead of one #49

Closed lewisje closed 8 years ago

lewisje commented 8 years ago

If there are multiple hostnames on a line, the names after the first are treated as aliases for the first, which means that it takes less time to load in the file; also this trims file size by minimizing the number of occurrences of the redirect IP address in the file.

Although even 24 hostnames per line works in Unix-like systems (although too many names per line itself has its problems), Windows ignores any hostnames on a line after the first nine, so nine per line is ideal: http://forum.hosts-file.net/viewtopic.php?p=16438&sid=3e0ec8605c66da5a6a4bdd1bb49b5fbb#p16438

StevenBlack commented 8 years ago

Hi @lewisje

I've thought about this.

I occasionally find myself eyeballing various regions of the hosts file, for various reasons.

It seems much easier to scan a single column.

If we go to multiple hosts per line, I think I would keep it to 80-100-columns wide, or thereabouts, which would impose a constraint fewer than nine certainly.

Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as

So far I've anecdotally seen few benefits, one way or another. The hosts file lookup appears to be sufficiently high in the latency stack to maybe not fret about?

Either way, I'm curious to know.

lewisje commented 8 years ago

I think I should figure out how to precisely measure this, but I know that when I run ipconfig /displaydns on my Windows machine (to force-map the hostnames in the local DNS cache), it takes less time with multiple hostnames per line than with one, even if I suppress output (just printing the output often takes lots of time with long-running commands).

I'm thinking this suggestion is more akin to delivering a minified JS file for wide-scale Web deployment while retaining a properly spaced-out JS file for development.

Gitoffthelawn commented 8 years ago

@StevenBlack wrote:

Know what interests me greatly? Metrics for the performance of host files as a function of orthogonal factors such as...

  • 0.0.0.0 vs 127.0.0.1
  • How file length (number of lines) affects load and parse performance.
  • The degree that multi-hosts per line helps, as seems reasonable to presume.

That will be extremely valuable information if anyone performs the testing. I'm amazed that detailed tests have not already been publicly documented. Cross-platform testing is essential, and will enhance the value of the data even further.

HansiHase commented 8 years ago

Hey guys, I ran some short tests. First of all it's important to mention that I did NOT do any statistically evaluable stuff here. Just one try for every test case. No repetition - just a "let's see where this could possibly lead" thingy.


System Router: Archer C7 v1 Router OS: openwrt BB Router DNS: dnsmasq The router contains the used hostsfile.

Client: Windows 7 Desktop Software: Cygwin for Linux Tools on Windows

Connection: wired gigabit ethernet


Test Case

  1. Router: flush DNS cache
  2. Windows: time nslookup $WEBSITE get responsetime (uncached)
  3. Windows: time nslookup $WEBSITE get responsetime (cached)

Remote DNS-Server is 85.214.20.141 (https://digitalcourage.de/support/zensurfreier-dns-server)


Results

I used a hostsfile with 355981 entries. This is 0.0.0.0 only file - no ::1 entries.

S = single entry (one host per line) - size 11 MB N = 9 hosts per line - size 8,4 MB

Unblocked Sites

Site S uncached (s) S cached (s) N uncached (s) N cached (s)
github.com 0.102 0.051 0.099 0.074
openwrt.org 0.095 0.054 0.105 0.055
imgur.com 0.094 0.054 0.083 0.054

Blocked Sites

Site S uncached (s) S cached (s) N uncached (s) N cached (s)
google-analytics.com 0.059 0.059 0.060 0.057
zzzha.com 0.057 0.051 0.056 0.054

Note: For this case I added the ::1 entry for googleanalytics and zzzha, so the AAAA-Request doesn't get forwarded.


single entry to nine entries per line conversion - Bash Script

I wrote a short script, so you can try it yourself. It needs input hostsfile as the argument. It writes the file hosts_nine.

#!/bin/bash

echo "127.0.0.1 localhost" > hosts_nine; cat $1 | grep "^0" | sed "s/0\.0\.0\.0//g" | tr -d "\n" | egrep -o '\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+\s+\S+' | sed 's/^/0\.0\.0\.0 /g' >> hosts_nine

NOTE: The there will be 0-8 entries missing in the generated file. With a base file of 300000+ entries this is "okay" for testing purposes I hope. This behaviour is a result of "let's not put too much time into this and live with the bias". The Problem here is the egrep expression. If the last entries of the file are not exactly 9 lines, they will be dismissed.

StevenBlack commented 8 years ago

Thank you @hd074, that's vastly interesting.

This seems to confirm what I've seen through informal observation: not much, if any, measurable benefit.

HansiHase commented 8 years ago

Next Thing: (127.0.0.1 + ::1) vs (0.0.0.0 + ::) and Filesize


Again: I did NOT do any statistically evaluable stuff here. Same setup as above.


Test Case 1: 127.0.0.1 vs 0.0.0.0

  1. Router: flush DNS cache
  2. Windows: time nslookup $WEBSITE get responsetime (pure DNS)
  3. Router: flush DNS cache
  4. Windows: time wget $WEBSITE get responsetime (request for website or content)

Since the last test had shown that there's no real difference between cached or uncached entries when using blocked host names I did not test this separately this time.


Results I used a hostsfile with 712131 entries.

L= localhost version (127.0.0.1 and ::1) N= non-routable meta-addresses (0.0.0.0 and ::)

Site L dns (s) L wget (s) N dns (s) N wget (s)
google-analytics.com 0.069 2.034 0.072 0.032
zzzha.com 0.073 2.033 0.074 0.029

Surprise, surprise: The DNS-Request itself does not differ. That's what we expected. But if we work later with the returned addresses to request content and whatnot the difference is pretty huge. We expected that too.


Test Case 2: Filesize

I just compared the results from both tests (355,981 vs 712,131 entries)

NOTE: What I compared here is the following:

File 0.0.0.0 entries ::1 entries
355,981 355,979 2
712,131 356,066 356,065

The fact that the second file doesn't contain new "unique" entries (its just all 0.0.0.0 entries duplicated and moved to ::1) MAY have an impact on the results. The point is that I can't (and don't want to) look into dnsmasq.

Nonetheless the result show the same behaviour as the time I moved from a from a pure 0.0.0.0 hostsfile with 25,000 entries to a pure 0.0.0.0 hosts file with 355,000+ entries some time ago.


Results

Site 355981 entries (s) 712131 entries (s)
google-analytics.com 0.059 0.072
zzzha.com 0.057 0.074

doubled file size, but the response time is not doubled.

When I moved from small a file to an approximately ten times larger file some time ago the response time increased from 0.032 to 0.050 (if I remember correctly). So the file size itself does not seem to have a very big impact on response time... if using dnsmasq.

StevenBlack commented 8 years ago

This is great!

Gitoffthelawn commented 8 years ago

@hd074 This is _fantastic_ data you are generating.

For completeness, is this 32-bit or 64-bit Win7? Is it Win7 or Win7 SP1? Also, which edition of Windows are you testing?

HansiHase commented 8 years ago

@StevenBlack thank you very much.

@Gitoffthelawn thanks to you, too. It's Windows 7 Professional 64-Bit, Service Pack 1.

further relevant: ASUS P7P55D PRO Motherboard Intel Core i7 860 @ 2,8 GHz no additional network adapter.

lewisje commented 8 years ago

I think that in your script, where you have /0.0.0.0/, you should escape the periods and have /0\.0\.0\.0/

HansiHase commented 8 years ago

@lewisje you're right. thank you. corrected it.

lewisje commented 8 years ago

I forgot another tiny thing: You could also match for the start of the line and for a space after 0.0.0.0 to be sure you don't strip out, say, subdomains like 0.0.0.0.example.net

Gitoffthelawn commented 8 years ago

So is there a best methodology that can be adopted based on this dataset?

Gitoffthelawn commented 8 years ago

See also https://github.com/StevenBlack/hosts/issues/47 for more related discussion.

sierkb commented 8 years ago

Relating OS X, see also the Open Radar Bug Long /etc/hosts entries lead to unbearably slow resolution rdar://24237290 and the response of an Apple engineer.

lewisje commented 8 years ago

I guess that means that nine hostnames per line is a best practice for both Windows and Mac.

HansiHase commented 8 years ago

It means that a 9 hosts per line file performs better than a >9 hosts per line file (on a mac).

I don't really see the advantage of the nine hosts per line method (vs single entry per line). The only thing that comes to my mind is the lower memory usage but I think nowadays memory isn't a thing to worry about (edit: regarding this project).

My concerns regarding this method are the readability and the maintainability. This is why I'm personally suspicious if it really is best practice.

lewisje commented 8 years ago

The way I understood it, it's like Windows doesn't read hostnames after the ninth on a line, so the max. for that platform is nine per line, and I had remembered that OS X could read 24 per line (never tested higher) but bogged down, but I wasn't aware that 10 was the tipping point (and 9 is still within the safe zone for a Mac).

memory isn't a thing to worry about

never true.

With that said, it definitely is easier to maintain a list of hostnames with one per line and then output a nine-per-line version for deployment.

HansiHase commented 8 years ago

@Gitoffthelawn

So is there a best methodology that can be adopted based on this dataset?

What Why/When But...
0.0.0.0 always (bc timeout) compatibility
large # of entries no (big) influences (possibly) system depending
9 entries per line shortens filesize readability/maintainability
1 entry per line readability/maintainability filesize
caching yes? faster lookup of non blocked sites no influence on speed with blocked sites
HansiHase commented 8 years ago

@lewisje Maybe I got you wrong. If we choose to use multiple entries per line then 9 hosts is the way to go. I agree with that.

I thought "9 entries is best practice" was referring to the whole "1 entry vs 9 entries vs X entries"-problem. In this case I did and do not agree.

matkoniecz commented 8 years ago

Given that the only benefit of this proposed readability decrease is filesize reduction it seems to not be worth it. Even on mobile devices this filesize change is not significant.

StevenBlack commented 8 years ago

So closing this now.

RoelVdP commented 5 years ago

Are there any settings which can be made for dnsmasq which would load the full host file into memory and thereby making everything quicker? or is that default?

dnmTX commented 5 years ago

@RoelVdP dnsmasq by default is caching the hosts file(s) in the memory and it's by far the fastest dns resolver.If there are any slow downs on your end you need to look for the problem elsewhere.

RoelVdP commented 5 years ago

@dnmTX thanks mate. Any way to check it is effectively loaded in memory when the file is rather large? Also, any way to make any cach(ing) larger? Thank you, very appreciated.

dnmTX commented 5 years ago

Any way to check it is effectively loaded in memory when the file is rather large?

@RoelVdP there is not really a easy way to check this as everything cached in the memory is in some hidden files,but i can assure you that this is the case. Dnsmasq is design to work from the memory and that is why is so fast.Along with the given hosts file(s) it caches every response as well so to check how effective it is,simply do time nslookup domain.com and you'll see.Here,i made a example from my router: Capture

Also, any way to make any cach(ing) larger?

Now,you need to clarify how you blocking those domains.There are two options,one is trough the .config file,example: server=/domain.com/0.0.0.0 and so on and one is trough hosts files(s) with added entry in the .config file to point to it: addn-hosts=/dir/to/your/file/hosts. First option has some limitations about how many entries can dnsmasq cache and whatnot,so it's not really recommended even though many repos here who offering hosts files have that option present. Second option is the one to go with.The developer noted that dnsmasq was tested with one million entries successfully just for such a bug file it's required at least 1GHz CPU or faster. So to answer your question,caching is plenty,unless you tell me that your hosts file(s) contain more then a million entries and no,no way to expand that as it is in the kernel.

RoelVdP commented 5 years ago

@dnmTX Thank you very much for the detailed reply. Excellent idea on the nslookup. Tried that and results are about 0.5 seconds for first lookups. So, I am not using any special config in dnsmasq but rather a large /etc/hosts (with 722k entries) file which dnsmasq then uses 'indirectly'. (See https://github.com/RoelVdP/MoralDNS). I wonder now if addn-hosts in .config can be pointed to the /etc/hosts file and if this would cache it (perhaps it was not caching and the OS was the limiting factor. I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay. Thank you again. Let me know if you have any other thoughts.

dnmTX commented 5 years ago

I wonder now if addn-hosts in .config can be pointed to the /etc/hosts file and if this would cache it...

@RoelVdP i'm really not sure what you mean by that.As long as you point dnsmasq to the file it will read it and cache it.Easiest way to check is from the system log(syslogd).If it's disabled on your end,enable it and restart dnsmasq(or your system) and check the logs.Here,another example for you: Capture I would not recommend to override or append to /etc/hosts as in some instances after restart that same hosts will revert to it's previous state and all those blocked domains will be gone.It's always better to to add it as a addition and to be stored where can't be deleted due to restart or some sudden shutdown.

Let me know if you have any other thoughts.

Yeah,like bunch.I went briefly trough your script and you can do some improvements to kind of lower the size(entries) and make it more responsive: First: Get rid of this one: wget -Oc http://sysctl.org/cameleon/hosts It's abandoned from the maintainer since 2017,if you wind out the duplicates and all the dead domains you'll end up with probably 5000+,out of what....23,000+(doesn't worth it) Second: Check for empty lines,comments leftovers etc. especially in StevenBlack's lists. use sed '/^#/d; s/ #.*//g; s/ #.*//g; /#/d; /^\s*$/d' a > tmp in that order Third Duplicates: They are a lot. If you manage to get rid of them you'll probably shrink your file to half. sed will not cut it there,use awk or even better gawk for that task as it is blazing fast. Compare each file to StevenBlack's before you merge it. This is your command: gawk 'NR==FNR{a[$0];next}!($0 in a)' stevenblack the-other-file > no-duplicates-file mv no-duplicates-file the-other-file <- this is optional Do this on each one then merge them all together.But first do clean(comments and whatnot) and add zeroes-IMPORTANT !!! Still,you are loading too many lists,some of them are really not needed as they're based on others that you already using(especially EasyList,EasyPrivacy in my opinion) so some delay is to be expected.

I am starting to understand why pages are loading slow - if there are many lookups then many * 0.5 seconds = long delay.

You do realize what 0.05s out of 1(one) second is,wright? You got that completely wrong.Can't go any faster than that bud. There are no lookups there,file is cached in memory=memory is fast=there is one lookup,or let's say ten(when open some page)=and there is comparison to all the entries in the cached file,which equals to 0.05s each or 0.50s combined.How is that not fast?

# With thanks, MalwareDomains list wget -Ob https://mirror1.malwaredomains.com/files/justdomains grep -vE "^#|^$" b | sed "s|^|0.0.0.0 |" > tmp

I just looked at it and it's wrong. This list does not come with any comments or empty lines and when i tried the command it was soooo slow. So for this one(only) just use sed 's/^/0.0.0.0 /g' b > tmp. Also grep is not your friend here,sed can do all those tasks on it's own(research commands) For sed double quotes are not needed(use single instead),straight brackets also(use / instead) You better inspect again each file and reconfigure your commands.

Another TIP: Some lists comes with bunch of comments on the top and that's it,the rest is only domain entries,so in this case(after confirmation aka visual inspection) use: sed '1,8d' b > tmp (adjust those numbers to your needs) this will delete from line one to line eight and that's it,and it's ten time faster then: sed '/^#/d' b > tmp

dnmTX commented 5 years ago

@RoelVdP this will be my last post here as we really went OFF TOPIC on this one and i know...some...are not happy about it. So good luck and i hope whatever i posted above would help to make your project better. 👍