duckduckgo / tracker-radar

Data set of top third party web domains with rich metadata about them
Other
1.5k stars 191 forks source link

hosts file with all domains included #2

Closed beerisgood closed 4 years ago

beerisgood commented 4 years ago

It is possible that you can create a hosts file with include all domains? So we can use that with eg. PiHole

Would be awesome!

LorenzoAncora commented 4 years ago

It is possible that you can create a hosts file with include all domains?

What would be the impact on system performance?

beerisgood commented 4 years ago

It is possible that you can create a hosts file with include all domains?

What would be the impact on system performance?

On PiHole? None. My old Raspberry Pi 2b block 1,8 million domains and CPU is at 0,x % with memory on ~27%

thefaj commented 4 years ago

My thoughts exactly—this would be awesome for pi-hole users!

LorenzoAncora commented 4 years ago

My old Raspberry Pi 2b block 1,8 million domains and CPU is at 0,x % with memory on ~27%

Have you measured kernel overhead? Have you done some stress tests to confirm what you say?

beerisgood commented 4 years ago

Have you measured kernel overhead?

Don't know what you mean

Have you done some stress tests to confirm what you say?

Sure. I use PiHole now many years.

jdorweiler commented 4 years ago

I wouldn't include it here, but anyone is welcome to use this data to build their own hosts file. It might require some testing and consideration for what you want to include. My guess is taking all 5k+ domains and turning them into a hosts file would cause some breakage.

Michelenzoo commented 4 years ago

Hi fellow Pi-Hole users, I also wanted a hostlist for Pi-Hole, so I have just generated this list: https://gitlab.com/michelt/ddg-tracker-radar-hostfile/-/raw/master/hostlist.txt I used the Domains folder to get all the domains, I assumed that the domains in the Entities folder were also in the Domains folder. Am I correct @jdorweiler ?

jdorweiler commented 4 years ago

That works, but watch out for some of the domain categories. Including everything might cause a lot of breakage https://github.com/duckduckgo/tracker-radar/blob/master/domains/cloudflare.com.json#L5998

Michelenzoo commented 4 years ago

Hm, you are right, my list is a little short sighted. I think I will filter all the domains used for CDN's out of the list by default and make an separate list with all domains (including CDN's), for the folkes who would rather whitelist than blacklist.

Michelenzoo commented 4 years ago

List without domains classified as CDN and Online Payment: https://gitlab.com/michelt/ddg-tracker-radar-hostfile/-/raw/master/hostlist.txt List with all the domains: https://gitlab.com/michelt/ddg-tracker-radar-hostfile/-/raw/master/hostlist-full.txt

rjhancock commented 4 years ago

Going to have to experiment with this so ads/trackers can be added to the HOSTS generated at https://github.com/StevenBlack/hosts as well as a LittleSnitch subscription set.

turtle2472 commented 4 years ago

I would love to see the list done by DuckDuckGo directly for my pi-hole as well. I'm sure at this point I might have most of them. @Michelenzoo I looked through your list and would love to see it sorted alphabetically.

Michelenzoo commented 4 years ago

@turtle2472 Good one. They are now sorted.

sebrk commented 4 years ago

https://gitlab.com/michelt/ddg-tracker-radar-hostfile/-/raw/master/hostlist.txt seems to block stuff it shouldn't. Ironically it blocked duckduckgo.com for me and even logging into GitHub (github.com/login).

beerisgood commented 4 years ago

Going to have to experiment with this so ads/trackers can be added to the HOSTS generated at https://github.com/StevenBlack/hosts as well as a LittleSnitch subscription set.

What has Stevenblack's list to do with your own?

Michelenzoo commented 4 years ago

https://gitlab.com/michelt/ddg-tracker-radar-hostfile/-/raw/master/hostlist.txt seems to block stuff it shouldn't. Ironically it blocked duckduckgo.com for me and even logging into GitHub (github.com/login).

I see, so does that imply that DuckDuckGo tracks us? :) BTW, it is on the list because my generator script is pretty dumb. It iterates over all the files in the domains folder and adds it to the hostlist.txt if either categories is empty or does not contain the words 'cdn' or 'online payment'. As you can see here, the DuckDuckGo file has no categories. This results in the script just adding the domain to the list.

sebrk commented 4 years ago

Yes, someone (including myself) should have a proper look at the data and create a structured filter.

beerisgood commented 4 years ago

Sadly even the non-full list block invidio.us invidio.us is a YouTube frontend with better privacy. Don't know why this get blocked

rjhancock commented 4 years ago

Going to have to experiment with this so ads/trackers can be added to the HOSTS generated at https://github.com/StevenBlack/hosts as well as a LittleSnitch subscription set.

What has Stevenblack's list to do with your own?

Because I use his as a base for my own and would contribute this back into his for the greater good of others who use his.

rd-su commented 4 years ago

Also add support for uBlock Origin, and others content blockers.

Using the list to block third-parties...

PsyEng commented 4 years ago

Did someone of you looked into the files? It would be better to create a script, which only take the Urls, which contain trackers or something else. Maybe I will look deeper into it, but it could take a moment or two, because json isn't the best friend of mine, but I think, I've an idea.

Bullshit detected: More problematic would be the regexpression, which was used in the files, because they aren't conform to any ad-/dns-blocker, if I see it correct.

Edit: Looked a little bit around. I misunderstood some of you about the iterating and the files and thought, you only take the filenames.

It should be possible to do this, with the break conditions and some manual editing, after testing, but huge, automatic generated lists like this one, will never be 100% perfect and anyone can get false positive, which can't be sorted out.

If someone likes help, it would be a pleasure to help and learn some new stuff.

jdorweiler commented 4 years ago

There's some good discussion on this in the pihole subreddit. https://old.reddit.com/r/pihole/comments/fdws51/duckduckgo_tracker_radar/fjkkzjq/

beerisgood commented 4 years ago

@jdorweiler why closing?

jdorweiler commented 4 years ago

It's better to handle this in the specific client repos that would use a hosts list.

TPS commented 4 years ago

@jdorweiler So, just to clarify, y'all don't want to publish any kind of app-ready final product (even just plain text or hosts list), but are providing this repo solely so that developers can format the data themselves & include into their own apps?

jdorweiler commented 4 years ago

@TPS that's right.

thefaj commented 4 years ago

Yikes. So instead of making something useful (a la Let’s Encrypt), this is just an academic project? A hosts file would be extremely pragmatic (and probably not much work for you people to put together for the community to get a whoooooole lot of goodwill).

thefaj commented 4 years ago

Following on that comment—what is the point of publishing this project?? Do you want to make some difference, or do you just want the techmeme referrals?

rjhancock commented 4 years ago

I think what they provided is more than enough as they didn't have to do it to begin with. They are providing this as a courtesy of what they found. It's up to others to figure out what to do with it.

Providing a hosts file for something that is unknown is not wise and can create issues for others.

thefaj commented 4 years ago

What an awful apologist comment.

PsyEng commented 4 years ago

@thefaj Your comment is completely unrespectful and a shame for anyone, who worked on the project and did a good job. @jdorweiler provided a link to a reddit thread, where they was talking about the problematic, why it's not useful to put this domains in a simple hosts file. It's not possible in any way, to provide a clean hosts list for pi-hole or something else, because you've to block specific parts of a site and not the site/domain by itself. The data could be used, to create a blocklist for uBlock Origin or similar adblockers, but you've to put a heck of time and effort, to test this, because you've to create costume scripts, to crawl and get the correct data on the correct format and test like an idiot, to not break the internet. Google, Facebook, Microsoft and even duckduckgo itself, would be blocked completely, if you took the data by itself, like you like it.

Please, be more respectful, you're in a community, which provide free content, you don't have to pay a single cent and everyone do this in his free time. Not anything, which is shiny, is gold, some things, are only poo.

P.S.: This doesn't include criticism, because this is important for all, sometimes, you can't see the easiest things, so please give feedback and critisim, but don't be rude and respect the desicions of the devs.

P.P.S.: Who finds typos, could take it by himself, or bring it to Germany^^