complexorganizations / content-blocker

✔️ Content Blocker is a robust web filtering project aimed at enhancing online privacy and security.
Other
16 stars 1 forks source link

Lists missing #24

Closed pallebone closed 3 years ago

pallebone commented 3 years ago

Hi,

I was testing these lists but today they are all 404.

EG: https://raw.githubusercontent.com/complexorganizations/content-blocker/main/configs/hosts

No longer exists?

Kind regards Pete

ghost commented 3 years ago

New lists will be available in the next 24 hours, since they are so large git wasn't able to support them.

@pallebone

If you want to donate your system resources let me know.

pallebone commented 3 years ago

What kind of resources?

ghost commented 3 years ago

Mostly CPU, some bandwidth (5GB), and storage (1 GB)

pallebone commented 3 years ago

I can host but cpu is limited on the cloud instance I use. Probably better to just ftp up any files you want hosted and compute locally on your pc. It will be faster.

ghost commented 3 years ago

Hey, it's going to take a couple of hours for it to update but it's fine, Its not the hosting that's the issue the hosting is fine, it's just that generating the list takes too long and too much system resources,

This is one of the biggest list in GitHub that I know of, and I ran out of my system resources in a single day.

I have to use Git LFS just to host the files, the more scan there are the more ad servers I find.

pallebone commented 3 years ago

Ok if you change your mind and need a place to put files in the cloud let me know and I can setup an ftp account for you.

ghost commented 3 years ago

It's okay, thanks tho.

I don't need storage I need CPU resources to find more lists and than validate those lists.

pallebone commented 3 years ago

Ok. Sorry I dont have that. Hope you can find a solution.

Saugjunkie commented 3 years ago

for few days, you can use my vserver for this

ghost commented 3 years ago

for few days, you can use my vserver for this

Its okay thank you, i have tried cloud servers and bare metal servers, but the issue is that it wont be able to handle the load.

Saugjunkie commented 3 years ago

hmm okay, i still sent you the server data to protonmail.com just try it out

ghost commented 3 years ago

hmm okay, i still sent you the server data to protonmail.com just try it out

Its okay thank you, i don't need it.

The problem right now is that I have over 10 million domains on my lists, and it can take up to a week to validate them all one by one, or I can validate them super fast, but that uses a lot of system resources. All I need to do now is figure out a way to validate them super fast without using a lot of system resources, and we'll be fine.

They're a large list with over 20 million domains if I don't validate them, and a small embedded system cannot support lists that big.

Saugjunkie commented 3 years ago

why all at once in one big list? Do you just make several smaller ones and divide them up on several servers? Surely that only has to be done once and some are duplicated.

In the first round a file with 1,000,000 entries, all that remains, merge and divide again, check merge.

ghost commented 3 years ago

why all at once in one big list? Do you just make several smaller ones and divide them up on several servers? Surely that only has to be done once and some are duplicated.

In the first round a file with 1,000,000 entries, all that remains, merge and divide again, check merge.

This is one of the options i have thought of but i know if i work on it a bit more, i can make a single automated system that will do this every 24 hours.

Once i get this pr to fix the issue, it will be able to validate 20M domains easily under 2 hours.

https://github.com/complexorganizations/content-blocker/pull/26

pallebone commented 3 years ago

Just fyi some malicious domains come online only when a scam is running eg: scammers turn on the webserver then disable them again to be used again in the future so validating they are online can be a negative consequence of removing domains not always online but should be blocked.

ghost commented 3 years ago

Just fyi some malicious domains come online only when a scam is running eg: scammers turn on the webserver then disable them again to be used again in the future so validating they are online can be a negative consequence of removing domains not always online but should be blocked.

Yeah, i check if the domain is registered or not, that's what i consider validation.

if there is even 0.01% proof that its registered than its a valid domain, Using about 10x different validation methods right now and will add more in the future.

There are 100000x of domains which are not even registered at the moment and still are on other people lists, it will take up too much system resources to block them, while they are not even registered.

Some guy wrote random word generator and than use that as a suffix list and than that is like 1GB of extra invalid domain names. shit like example.fdkljfhdlkjfd, cant be a valid domain but its still in people lists like wtf.

pallebone commented 3 years ago

Ok makes sense.

ghost commented 3 years ago

Stevenhost is one of the most popular lists I'm aware of, and they contain a handful of domains that don't make any sense. https://raw.githubusercontent.com/StevenBlack/hosts/master/hosts 0.0.0.0 castoola.tv.lan Even if it isn't a genuine domain, it is still on the list.

r1.sn-o097znlr.a1.googlevideo
r2.sn-o097znlr.a1.googlevideo
r3.sn-o097znlr.a1.googlevideo
r4.sn-o097znlr.a1.googlevideo
r5.sn-o097znlr.a1.googlevideo
r6.sn-o097znlr.a1.googlevideo
r7.sn-o097znlr.a1.googlevideo
r8.sn-o097znlr.a1.googlevideo
r9.sn-o097znlr.a1.googlevideo
r10.sn-o097znlr.a1.googlevideo
r11.sn-o097znlr.a1.googlevideo
r12.sn-o097znlr.a1.googlevideo
r13.sn-o097znlr.a1.googlevideo
r14.sn-o097znlr.a1.googlevideo
r15.sn-o097znlr.a1.googlevideo
r16.sn-o097znlr.a1.googlevideo
r17.sn-o097znlr.a1.googlevideo
r18.sn-o097znlr.a1.googlevideo
r19.sn-o097znlr.a1.googlevideo
r20.sn-o097znlr.a1.googlevideo

https://raw.githubusercontent.com/kboghdady/youTube_ads_4_pi-hole/master/youtubelist.txt Even if they aren't genuine domains, they're nonetheless on the lists.

You may utilize your Dns server and than block these domains but what's the purpose on wasting your own system resources on banning them when they are not even a legitimate domain name.

Last example

0.0.0.0 7cloudtech-vps.info from https://raw.githubusercontent.com/blocklistproject/Lists/master/fraud.txt It hasn't even been registered yet, yet it is still banned. https://domains.google.com/registrar/search?searchTerm=7cloudtech-vps.info

What are individuals trying to accomplish when they try to generate random domain names and then forecast whether or not an attack would utilize them in the future?

pallebone commented 3 years ago

As no further discussion warranted I am closing this issue to tidy up.

ghost commented 3 years ago

@pallebone I am going to push unvalidated data and than work on this, and than once its ready i will push the validated data.

ghost commented 3 years ago

@pallebone @Saugjunkie

The temp lists are ready, this is the biggest lists on github that i know of, the official lists contains over 5 mil valid domains, but these are temp update.

Saugjunkie commented 3 years ago

yeah verry nice and now we can testing this? with the coredns branch from wireguard didnt work

ghost commented 3 years ago

For now import it on unblock origin, and will push a fix for coredns

Saugjunkie commented 3 years ago

For now import it on unblock origin, and will push a fix for coredns

Unfortunately, I didn't understand that despite google translator

ghost commented 3 years ago

https://github.com/gorhill/uBlock

ghost commented 3 years ago

My lads, I finally got it working on a digitalocean vps with 64 GB memory and 32 cores. I'll send out the entire list tonight, and an automated tool by tomorrow.

pallebone commented 3 years ago

good job. sounds expensive though :(

ghost commented 3 years ago

good job. sounds expensive though :(

It costs around $1 per hour, and updating the list takes about an hour, so over the course of a month, it costs about $30, which is good.

I'll gladly pay $30 to ensure that all of the domains are legitimate and operating.

ghost commented 3 years ago

Guys, it works now and pushes a update every 24 hours.

Note: Every single domain is active and validated, i am 99% sure this is one of the biggest lists on github, there is one bigger than this but thats not validated data.

Saugjunkie commented 3 years ago

Guys, it works now and pushes a update every 24 hours.

Good News

THX @Prajwal-Koirala