AdguardTeam / HostlistCompiler

A simple tool that compiles hosts blocklists from multiple sources
GNU General Public License v3.0
158 stars 20 forks source link

Exclusion Question #67

Open KnightmareVIIVIIXC opened 4 months ago

KnightmareVIIVIIXC commented 4 months ago

I'm trying to figure out if there is an easy way or something that I'm overlooking when trying to only save the subdomain entries and eliminate the main domain entries. I'm using doubleclick.net as an example. In this json file, I have an exclusion file and the only entry is ||doubleclick.net^ because I thought that would prevent hostlistcompiler from compressing down to the top domain:

{
    "name": "Test",
    "sources": [{
            "source": "https://someonewhocares.org/hosts/zero/hosts",
            "transformations": ["Validate", "RemoveModifiers"]
        }],
    "transformations": ["Compress", "RemoveComments", "Deduplicate", "RemoveEmptyLines", "TrimLines"],
    "exclusions_sources": ["exclude.txt"]
}

However, this is not the case and it's still compressing to ||doubleclick.net^ and removing the subdomain entries. I'm also very tired so like I said, I could be overlooking something that is very obvious.

hagezi commented 4 months ago

@KnightmareVIIVIIXC

The exclusions seem to be applied to the source format, in your case hosts. Therefore, you must include 0.0.0.0 doubleclick.net in the exclude.txt:

hostlist-compiler -v -c test.json -o output.txt | grep 'doubleclick'
› 0.0.0.0 doubleclick.net excluded by 0.0.0.0 doubleclick.net
KnightmareVIIVIIXC commented 2 months ago

I just thought of something that would make my train of thought work: a convert transformation. This transformation would only do the first step of the compression process but not the second. This way, I could convert the individual lists into the adblock format with all of the subdomains still present:


||doubleclick.net^
||sub1.doubleclick.net^
||sub2.doubleclick.net^
||sub3.doubleclick.net^
||sub4.doubleclick.net^

Then during the global transformation at the end of the json file, where I have my global exclude.txt file, hopefully it will read it, see that I don't want ||doubleclick.net^ and only keep the sub entries, so the final result would be:


||sub1.doubleclick.net^
||sub2.doubleclick.net^
||sub3.doubleclick.net^
||sub4.doubleclick.net^

I've been using the exclusion_source function incorrectly, thinking that hostlistcompiler would know that while I don't want the base domain blocked, I do want to block the subdomains.

{
    "name": "Test",
    "sources": [{
            "source": "https://someonewhocares.org/hosts/zero/hosts",
            "transformations": ["Convert", "Validate", "RemoveModifiers"]
        }],
    "transformations": ["Compress", "RemoveComments", "Deduplicate", "RemoveEmptyLines", "TrimLines"],
    "exclusions_sources": ["exclude.txt"]
}

And if it looks like this:


||doubleclick.net^
||sub.doubleclick.net^
||1.sub.doubleclick.net^
||2.sub.doubleclick.net^
||3.sub.doubleclick.net^
||4.sub.doubleclick.net^

The final result would just be ||sub.doubleclick.net^