ScriptTiger / scripttiger.github.io

GitHub Pages
https://scripttiger.github.io
MIT License
16 stars 0 forks source link

hosts to alts/ script #1

Closed spirillen closed 5 years ago

spirillen commented 5 years ago

Hi @ScriptTiger

I can see you have made a script to distribute hosts files to DNS files, this is the very same that would be happening in my "RPZ DNS firewall tools" repo.

Which should collect from External Sources and from the Matrix source lists.

The Question is, would you like to share this script and/or modifies it to be able to this works and post release it in the "RPZ DNS firewall tools" group? As I find clumsy to reinvent the wheel

ScriptTiger commented 5 years ago

I certainly plan to release everything, for sure! Before I release it though, I just need to clean some things up. Over the evolution of my website I have been continually adding and modifying my internal scripts. The original script I wrote was clean and had comments and everything. However, over time as I added more to it, it just got messier and messier and isn't even all that efficient, I just got lazy to fix it because ultimately it works even though I know it could work better.

Another problem with my scripts is that they can cause serious problems with the latest Windows 10 builds. Windows 10 has been pushing a lot of "performance" enhancements which don't work well for my scripts due to the internal iteration processes. I actually have to boot into safe mode just to run my script every time I regenerate my website. I'm trying to minimize this as much as possible, as well, before I release it so it's not wreaking havoc and breaking people's system's unnecessarily.

I also, of course, have a day job and family, etc., and all of this is just a hobby of mine. I squeeze in time to work on it when I can, but can't really put any deadlines to it. Of course I really appreciate all the interest and it definitely further motivates me to try to speed up my development a bit more, but I hope everyone can understand if I work a bit slow when it comes to these things.

ScriptTiger commented 5 years ago

@spirillen, if your project is mostly just interested with the RPZ script, I can rip all of those pieces out and make a script just for that, which might be faster and more efficient if that's the piece of it you're after.

EDIT: I have just created a local repository for this and should have it out within the next week or so, as time permits.

EDIT: I have re-purposed an existing repository I already have for this purpose.

https://github.com/ScriptTiger/Hosts-Conversions

I will slowly churn out conversion scripts for all of the formats my website currently supports and I will mirror that repository on GitLab.

spirillen commented 5 years ago

Hi @ScriptTiger

if your project is mostly just interested with the RPZ script, I can rip all of those pieces out and make a script just for that, which might be faster and more efficient if that's the piece of it you're after. Would be extremely great :cool:

Your right about it's primary the RPZ formatted files I'm into in the project as I see hosts files for rather outdated do to the way Windows keeps resetting the hosts files several freeware of malicious software disguised as antivirus like (such as 360totalsecurity.com), another issue by the usage of hosts files is the size of them, which impact the system as the hosts files never have been intended to be used for anything else but a few lines on a little local network.

Today you have several good alternatives to that by using DNS servers or DNS recursors.

I tent to recommend that Windows home edition users setup a Unbound recursor and that *nix users setup either a DNSdist with dns recursor both from powerdns.com and alternatively Bind9 as there RPZ is better in the moment of writing.

From you link above to ScriptTiger/Hosts-Conversions I noticed that is coded to be running on Windows platform.. May I suggest rewriting it to bash/perl/python and have it run on a CI/CD cron job?

Don't worry about the time scope as it is a hobby project for me too :)

I will be working on a script to import all external sources into a list formatted as domain.tld in the hosts sources folder and try to make it run on CI/CD basis. It would be in the style subfolder(source_name)/domain.list (UTF-8 encoded)

ScriptTiger commented 5 years ago

From you link above to ScriptTiger/Hosts-Conversions I noticed that is coded to be running on Windows platform.. May I suggest rewriting it to bash/perl/python and have it run on a CI/CD cron job?

We can certainly port it to any language, but I'll just fill that repo up first with Windows scripts for all of the supported formats first so we can use them to prototype. More important than code is the notation that follows it, which can be used as pseudocode/guidance/instructions to more quickly develop for any language once there is an initial prototype.

spirillen commented 5 years ago

Great , I like you thinking :)

ScriptTiger commented 5 years ago

@spirillen, I just put out the RPZ converter. I still plan on putting out a converter for each format I support, but just thought I'd give you a special announcement since I know that's the one you're interested in.

Which should collect from External Sources and from the Matrix source lists.

Right now the Hosts-Conversions repository is only for converting hosts files. From looking through your sources, it seems you have a mix of formats, including domain name lists as well as hosts files and possibly others. Since creating a script to specifically pull and convert from your sources would be highly specific to your project alone and not so much related to my website and/or it's data generation, I'm going to close this issue here on GitHub and this can be discussed further on your GitLab project.

ScriptTiger commented 5 years ago

@spirillen, are you looking for a script to merge all of your sources together into a single RPZ file, similar to what Steven Black's Python script does for hosts files?

Have you talked to Steven Black about getting any of your lists merged with his current data sources? https://github.com/StevenBlack/hosts/tree/master/data

Steven's script currently has a large community and a lot of support behind it. If you get your lists merged with his, it would be a lot easier and you'd be helping people out by contributing your bad hosts to the collective.

spirillen commented 5 years ago

Hi @ScriptTiger Thx for your thoughts :smile: and yes i tent to offer my lists into @StevenBlack's lists, but currently his program requres that the hosts files starts with [127.0.0.1|.0.0.0.0]which Im working on :)

For you 2. Q, yes one orther idea is to change the hosts files into RPZ zones for a more modern and correct way of adding huge block lists :exclamation:

But After my coffee I will move on with the matrix's gitlab-ci to sort the lists and then the hosts-sources :)

Unfortunately I a bit busy sailing the next couples of weeks

ScriptTiger commented 5 years ago

No worries, take your time. I think the main thing is just to get it all in the same format, whether it's just a list of domains, hosts file format or RPZ doesn't really matter. Conversions from one format to another are always easy, but mixing in different formats gets more complicated and confusing.

Enjoy your sailing!

spirillen commented 5 years ago

That's true and that why i make the domain only list, to have some raw material for easy converting :)

And thanks :boat:

TPS commented 2 years ago

@ScriptTiger Just wanted to mention, that per https://github.com/StevenBlack/hosts/issues/1853#issuecomment-1069702868, you seem to have the best version of @StevenBlack's hosts files for Windows users, & to request that https://scripttiger.github.io/alts/ continue to stay updated for all of us. Thanks! 🙇🏾‍♂️

ScriptTiger commented 2 years ago

Thanks for your support, @TPS!

@StevenBlack does what he does very well, and I can't fault him for staying focused, be it eccentrically anti-Windows or otherwise lol. So, I've been providing supplemental data sets for a few years now.

I usually try and keep things up to date at least within a few days. I'm in the process of rewriting a lot of my back-end scripts to Golang at the moment, as well, to speed things up and make updating my website a bit less painful.

If you're interested in getting notifications when I update, you can always "watch" this repository for new commits or follow me on Twitter, https://twitter.com/ScriptTiger. My Twitter is mostly automated tweets with updates on Steven's repository as well as this one, although I do occasionally post other things that I find to be relevant or useful.

TPS commented 2 years ago

@ScriptTiger I just noticed that there are no descriptions on this page for adblock or dnsmasq versions on https://scripttiger.github.io/alts/. I use (also) the adblock version w/ AdGuard over several platforms & was looking to find out more re: the conversion process (e.g., whether any filter compaction was done due to the format being more efficient, &c). Would you be interested in adding that info?

ScriptTiger commented 2 years ago

I just updated the descriptions, but I'm not sure if it's as technical as you were expecting. Since these lists are security-related, it's best to keep them as simple and as static as possible, without doing any fancy REGEX multi-pass condensing or anything, which would actually just end up making things slower and increase the chances of failure/denial of service. That being said, since both the Adblock and dnsmasq formats allow for suffix matching, all child sub-domains have been removed from those lists if parent domains are already present on the list. This reduces redundancy and takes advantage of the flexibility of those formats, but also keeps things as simple and as static as possible at the same time to strike a balance between security, stability, and performance.

Paul Vixie, one of the fathers of modern DNS, actually graced us with a thread on Steven's repo in the past to discuss the best methods of handling things like this for his RPZ format. So, I basically continue to carry those best practices forward with other formats, regardless of how advanced their filtering schemes may be, in order to keep the focus on security and stability first.

I think one common assumption may be that having a smaller list which takes up a smaller footprint in memory will increase performance, so compressing and compacting as much as possible should be the way to go. But inherently the way REGEX works, being forced to break apart everything into sub-strings in order to parse expressions, can't compete with the speed of simple suffix matching, where the processor becomes the limiting factor and not the amount of memory available. And since Steven's lists are as lightweight and streamlined as they come from the get-go, memory should be the least of anyone's worries anyway.

I hope this helps, but feel free to reach out with any more questions or comments if need be. And, as always, thanks for your continued interest!

TPS commented 2 years ago

Thanks very much, @ScriptTiger, you answered my question beautifully. 🙇🏿‍♂️

spirillen commented 2 years ago

@ScriptTiger wrote:

That being said, since both the Adblock and dnsmasq formats allow for suffix matching

Would like to mention dnsdist in this relation as it supports full regex for DNS manipulation such as IP obfuscating and NXDOMAIN with an extremely small memory footprint of 184mb in my case. (seems like I have to do some house keeping there)

Examples

addAction('matrix.lan.', SpoofAction('127.0.0.2'))
addAction('www.matrix.lan.', SpoofAction('127.0.0.2'))
addAction(RegexRule("[\\.]?google-analytics\\.com"), RCodeAction(DNSRCode.NXDOMAIN))
addAction('fls-na.amazon.com.', RCodeAction(DNSRCode.NXDOMAIN))
addAction('myphonenumbers-pa.googleapis.com$', RCodeAction(DNSRCode.NXDOMAIN))
TPS commented 2 years ago

This was my fault in somehow continually choosing the wrong format's link. Sorry for the thread spam.

> I use (also) the adblock version w/ AdGuard over several platforms

@ScriptTiger It turns out your AdBlock conversion has a rather serious but infrequent error, but w/ a very easy fix: Per the universal syntax, you need to use prefix/suffix on the domains to keep from unintended matches.

I.e., let's suppose the domain is ad.example.co & is written like that in your Adblock conversion. It's meant to select - ad.example.co - adblockbypassjd.ad.example.co

Well & good. Problem is, it'll also match these, which clearly not intended - load.example.co - load.example.com - map.road.example.co - map.road.example.com -https://searchengine.example.com/q=site:ad.example.co # Used to see how many sites this pernicious domain might have

The solution is simply to write the rule as ||ad.example.co^ & instantly the spurious matches cease. It's very scriptable. 🙇🏿‍♂️

ScriptTiger commented 2 years ago

@TPS, thanks again for your continued feedback!

I have just verified my local and remote versions of the adblock files available on the website (https://scripttiger.github.io/alts/) as well as the open source code available for the Hosts-BL repo (https://github.com/ScriptTiger/Hosts-BL/blob/main/hosts-bl.go) which is what actually creates those files and I don't see any variation from the "solution" you're mentioning. Maybe I am just misunderstanding the problem? From what I see, all of the entries in the adblock files already have the prefix of || and suffix of ^. Is it possible you accidentally downloaded either the FQDN or RFQDN files by mistake? The FQDN and RFQDN formats are the only formats which contain only raw domain names without any additional context or characters.

TPS commented 2 years ago

Is it possible you accidentally downloaded either the FQDN or RFQDN files by mistake?

Umm, yes! 😳 I checked the lists so many times before typing all of the above & still managed to fat-finger it every time, apparently. I apologize. 🙇🏿‍♂️