chadmayfield / my-pihole-blocklists

Create custom pi-hole blocklists
GNU General Public License v3.0
334 stars 72 forks source link

Non-domains in list #2

Closed deathbybandaid closed 5 years ago

deathbybandaid commented 7 years ago

Your list has 4 lines of extra content (not devastating):

I assume these are just categories

I'm not familiar with perl enough to help, but my parsing script uses sed to remove lines that don't meet FQDN requirements.

A valid domain name has :

I'm assuming that this is a small addon to your script to parse them out.

BTW,, people really like your lists, and your lists are featured on https://wally3k.github.io/

Thanks again for such a great list!

chadmayfield commented 7 years ago

@deathbybandaid I haven't forgotten your request you made via Twitter! I've been so busy lately, I haven't gotten back to this project.

Regardless, yes these four lines should not be in the list. The list that the script downloads (adult.tar.gz) uses those four lines as TLDs to do a general filter to block all domains within the TLDs. At first glance it appears that the regexp on line 71 is either not working correctly or not called properly. The light list (comprised of all porn sites in the Alexa top 1 million list) does not have the errant lines. The heavy list (with ~2 million domains) seems to be the problem.

I'll look at it and report back.

Thanks for the report of https://wally3k.github.io/, I'll have to head over there and see it. I had no idea people would use them!

deathbybandaid commented 7 years ago

I ended up figuring out my script on my own, I'm a Windows IT Admin, playing with linux,, so it's been a challenge.

Are you planning on running a weekly cron for this?

chadmayfield commented 7 years ago

I think initially my plan is to probably run the script via cron, weekly, if not monthly. There isn't much movement in the list other than possibly reordering the top sites and adding a few to the end, which really wouldn't affect most of the sites on the porn list. I also want to be respectful of the upstream providers by not hitting their servers too much.

FYI, here are some of the other lists that I plan on looking a making create scripts for (from my blog post from today);

pornography/adult/mature content bots/spiders hijacked/known exploit sites/spammers proxies known pedophile sites malware/spyware microsoft countries (cn/ru/su/etc...) illegal drug usage gambling dating violence/hate/racism

chadmayfield commented 5 years ago

@deathbybandaid sorry I have not looked at this sooner. I am finally getting back to it! The offending non-domains should not be in the list, and if you generate the lists yourself they will not show up. Let me know if you see problems.