Filter minimization and delta update: A bandwidth saving experiment

jspenguin2017 commented 6 years ago

Most users don't need to see the content of filter lists, so there is little reason to keep comments in the distributed file. However, this can make development difficult, so Nano will support a pre-processor directive that points a compressed filter to the development version.

!@pragma nano-srcmap=<address> When the filter viewer sees this statement, it will try to load the source map of the filter, if successful, the source map is cached and displayed instead.

https://github.com/gorhill/uBlock/issues/3406

Discussions about delta update are below

elypter commented 6 years ago

thats cool. commands help when looking at filters. at the same time i assume this saves precious ram. about the hosting issue, i dont think this should be a huge problem. hosting is getting cheaper and cheaper and these filter lists arent big files either. maybe these kind of problems will all disappear once filterlists.org progresses. features like combining and preprocessing lists are already planned and a central place for hosting those filters can make hosting cheaper as well because you do not need to rent a vps for each individual file but just one that you can upgrade if needed. a usable vps can cost as little as 10$/year

jspenguin2017 commented 6 years ago

This will not save RAM, but will save hard drive / SSD space, but that 1MB saved here will not help when your 128GB SSD is running low.

i dont think this should be a huge problem. hosting is getting cheaper and cheaper

That's exactly what I think, although it's the opposite of what gorhill thinks, he still have 5 days update cycle for uBlock filters and he isn't happy that Nano is puling it every 3 days instead. Well, it's his repository so I'll respect his decision, if he asks me to pull uBlock filters elsewhere I'll host them myself.

My VPS costs $5/month, with free 40GB/s instream, which makes it great at scraping websites and scanning for stuff across the Internet. Outstream isn't free though, so not a super good CDN, but if I minimize + gzip filters it'll not be a problem.

elypter commented 6 years ago

this problem could also be mitigated by adding a mirror directive so a filterlist can contain a list of mirrors that can be used to update if the primary one is down or to allow multiple primary mirrors for load balancing.

jspenguin2017 commented 6 years ago

assets.json does list multiple mirrors, although I'm not sure if that's just for preventing adding duplicate filters or it's actually used when the first entry is down. (I'll need to read more of the code)

elypter commented 6 years ago

there is not a very big need for this right now. if it will become neccesary one day i think it should be an option in the filterlist file itself so listmaintainers can deliver new mirrors on every update

jspenguin2017 commented 6 years ago

I'll start with minimizing stuff, because that's a cheap way to reduce bandwidth usage. TBH if I can get differential update set up, I can have filters update twice a day and it'll still consume less bandwidth.

elypter commented 6 years ago

before you try differential updates i would try to search the web for a ready made solution because it is probably quite some work to do it properly

jspenguin2017 commented 6 years ago

There are differential update engines out there for sure. I'm not crazy enough to go make one myself, don't worry about that... I was able to trim down 31% of file size just from removing comments though.

jspenguin2017 commented 6 years ago

It'll be !@pragma and only allowed for whitelisted filter lists, as I'm worried people can misuse it to hide filter rules or to track the user.

sebast889 commented 6 years ago

About saving bandwidth for filter lists is there a way to host extremely compressed zip/gz of filter lists and decompress on client/user side then use them? If there is a way to do this it could save a lot of bandwidth since filter lists are just text (at the cost of a bit more cpu from the user)

jspenguin2017 commented 6 years ago

You can zip it in any way browsers accept, although custom compressing logic is possible, it's unlikely to offer more than what browsers natively supports.

You can use this mapping to make your filters shorter, although it will break compatibility with uBO.

sebast889 commented 6 years ago

it's unlikely to offer more than what browsers natively supports

I'm not sure I understand this fully...what do browsers natively support?

I was thinking you could zip it server side using a compression algorithm like PPmd that works really well for compressing textfiles. User will download this .zip instead of the usual .txt filter list and Nano will decompress it on the computer and then use the decompressed .txt filter list. Is this possible?

jspenguin2017 commented 6 years ago

I think by text they mean more like English essay or ebook, for filter lists, I'm not sure how much better it will be compared to gzip. If we want to save bandwidth, differential update is more effective than better compression methods.

elypter commented 6 years ago

i wouldn't want to have too uncommon compression formats because if its too complicated for filterlist maintainers then they are not going to use it.

jspenguin2017 commented 6 years ago

I'm looking for a differential update (also called delta update) library, but unfortunately we might have to make our own for JavaScript (or customize based on existing tools). Many native solutions, but all of them aim at binary patching. I think we might want to look at something like this: https://github.com/kpdecker/jsdiff

It'll not be easy to enable for filter list maintainers, as you kind of need to process the difference on server side.

A cheap and easy solution is obviously removing comments, how are we going to process the source map is another question though... I think the filter viewer should be a reliable way for users to know what is the exact content of filter lists they installed, so I probably won't auto-load source map. I'll come up with a good plan.

sebast889 commented 6 years ago

I think the only way this will be possible is if someone (like you) mirror all the filter lists (example grab them every 1 hour) and do the processing on the lists and then use these as source for 3rd party filters. I don't think many if any maintainers will change their habits of serving just .txt filter lists

jspenguin2017 commented 6 years ago

The problem is a different compression algorithm will also require server side configuration, which is as hard to set up while being way less effective than delta update.

sebast889 commented 6 years ago

I was talking about differential updates too in my post (which I also believe is a better solution than good compression methods if it is possible to implement). Unless you think differential updates are possible to do without maintainers needing to do anything on their side?

jspenguin2017 commented 6 years ago

Even if I support a different compression algorithm, filter maintainers still have to set up the tools to process their filter so the new algorithm can be used. Since they need to set up tools anyway, they can very well set one up that does differential updates. Differential updates need to be processed, ideally by a server, but manual processing would also work.

sebast889 commented 6 years ago

But what I'm saying is I don't think filter maintainers will be willing to set up these server tools and process the differences. So it would need someone to mirror the filter lists and process them and use these processed lists as the filter list sources/URLs in Nano.

jspenguin2017 commented 6 years ago

@ameshkov Can we get your opinion on this? Do you think differential / delta updates for filter lists will save enough bandwidth to be worth the hassle?

jspenguin2017 commented 6 years ago

@sebast889 It's not exactly easy to do, I'll definitely experiment with it.

I'm checking the code of https://github.com/kpdecker/jsdiff I think it'll not be that hard to set up, although customization of the code might be required, as we want the patch to be applied to a string, not a file.

jspenguin2017 commented 6 years ago

If I'm implementing it in Nano, I'll host Nano filters, uBlock filters, malware domain list, and Peter Lowe’s list. When implemented, Nano will update those filters every day, the server will synchronize with source every 8 hours.

A high level (planned) implementation will be:

The server reads filter lists from their source 3 times a day, if something changed, store the difference and mark a version
Nano will send a POST request to the server asking for changes based on a marked version that it has (either in cache or in the extension package, whichever is more recent)
- If the version is already gone (due to being too old), the server will instruct Nano to read the filter from the source instead (this would only happen if the filter wasn't updated in the past week or something, which wouldn't cause bandwidth problem)
- If there is no change, the server will instruct Nano to mark the filter as up to date and move on
- Otherwise, the server will return all applicable patches and Nano will patch and re-parse its existing filters
Nano has fresh filter lists in its extension package, so chances are reaching out to the main host will never be needed

elypter commented 6 years ago

adtidy doesn't exist. i never heared of this site. it now redirects to adguard and the internet archive doesnt go back far enough

ameshkov commented 6 years ago

Hi @jspenguin2017

@ameshkov Can we get your opinion on this? Do you think differential / delta updates for filter lists will save enough bandwidth to be worth the hassle?

The problems with the current situation (let's take Easylist as an example).

Ad blockers download the whole file which size is ~3MB (or 800KB gzipped).
This means quite heavy bandwidth usage (for example, we have ~150TB/monthly).
This does not allow filters maintainers to set smaller "Expires" values as there is a direct proportion.
What we do in AdGuard (maintaining a filters index and incrementing versions) is not enough to solve that issue. Easylist alone is updated multiple times every day.

So yeah, delta updates definitely do worth the hassle, if implemented properly. The proper implementation should better be file-based. Even mere versions check (point 2 in your plan) if it involves any server-side logic will cause quite a heavy load -> the server will cost a lot. With a file-based implementation, this all can be hosted on an affordable server (or even on GH pages).

ameshkov commented 6 years ago

adtidy doesn't exist. i never heared of this site. it now redirects to adguard and the internet archive doesnt go back far enough

We re-host all the filters on filters.adtidy.org domain in order to ease their servers load, control the filters, implement filters versioning, and support hints.

jspenguin2017 commented 6 years ago

The proper implementation should better be file-based.

Right, good point. that'll be easier to set up for filter maintainers. So a couple Node.js scripts? Since we are going with file based implementation, we have to use 404 for "no more update". I'm not sure if that's less processing power than a hash table lookup though. Although that's probably the only way if we want it to work with GitHub's static server.

I'll be experimenting with it. We also need to get gorhill on the boat, he has a big declined sticker on the issue in his repo...

ameshkov commented 6 years ago

So use 404 for "no more update"?

I'd say it'd be easier to host an "index" file with a list of filters versions.

Regarding the diffs map, it could be something like this (just a draft, it's night time here and I can't think clearly).

filter.txt

! Version 1.0.123
! Diffsmap filter_diffs.map

filter_diffs.map

{
    "currentVersion": "1.0.123",
    "diffs": [
        { "1.0.123": "diffs/1_0_123.diff" },
        { "1.0.122": "diffs/1_0_122.diff" },
        { "1.0.121": "diffs/1_0_121.diff" },
        { "1.0.120": "diffs/1_0_120.diff" }
    ]
}

If the history is not deep enough to build an incremental update, you should fall back to the old approach. If the diffs map is not available or the structure is invalid, you should "forget" about it and re-download the filter.

upd: timestamp and checksum should be somewhere there too

jspenguin2017 commented 6 years ago

Yea, looks good. Although the overhead of an index file can be quite significant for smaller filters, I'm thinking of ways to reduce that...

ameshkov commented 6 years ago

Not that much compared to the savings. It eliminates additional requests, and every additional request is +0.5-1KB of data (headers) + up to 4-8KB on the SSL handshake.

jspenguin2017 commented 6 years ago

I'm making a NPM module, the default output structure is like so (draft):

filter.txt
filter-diff/
    meta.json
    checkpoint.txt
    1.patch
    2.patch
    3.patch

meta.json

{
    checkpoint: 35,
    latest: 37
}

filter.txt will be for adblockers that don't know how to do differential update. checkpoint.txt is a checkpoint file which is a full (but can be slightly outdated) filter. Applying too many diffs may not be efficient. When searching for update, the adblocker will read meta.json which will give instruction on what it should do. checkpoint is the version of the checkpoint, newest is the latest version, in the above example, the adblocker need to apply 1.patch then 2.patch on top of checkpoint.txt. 3.patch is left-over file that is not overwritten (don't generate more git history than required).

If there is a network error, wait 5 minutes then try again, for up to 5 times (if the server is overloaded, fallback too fast will just make things worse). The server returning something that looks like an HTML document counts as network error. if there are too many network errors or there is a parsing error, the adblocker will bail out and read filter.txt instead.

jspenguin2017 commented 6 years ago

One problem is when changing checkpoint, there will be a pretty big spike in traffic, might want to keep the last two checkpoints so adblockers that are almost up to date won't need to re-download a big file just because the checkpoint changed. If filter delivering servers are scaled down thank to differential update, a big spike can cause problems, although for GitHub (Pages), it shouldn't be a problem.

I guess optimizing for GitHub is different than optimizing for special servers that don't use git.

jspenguin2017 commented 6 years ago

I'm going to sleep now, I'll finish implementing the prototype tomorrow.

jspenguin2017 commented 6 years ago

It's up on NPM, I didn't really test it, should start although not guaranteed. https://www.npmjs.com/package/nano-sync https://github.com/NanoAdblocker/NanoSync

jspenguin2017 commented 6 years ago

I'm not sure if that's just for preventing adding duplicate filters or it's actually used when the first entry is down

After checking the code, Nano (and uBO) will indeed use the following URL in the event that the primary server is down, however, they will not reach to secondary server unless the primary server is down.

KonoromiHimaries commented 6 years ago

Nano (and uBO) will indeed use the following URL in the event that the primary server is down, however, they will not reach to secondary server unless the primary server is down.

mayby decentralization system is the best idea of this

jspenguin2017 commented 6 years ago

@KonoromiHimaries

mayby decentralization system is the best idea of this

I assume by decentralize you mean WebRTC, the problem with WebRTC is that there needs to be a central coordination server, and most routers block incoming traffic. It will also be easy for people to track who is using Nano. So that's not happening.

ameshkov commented 6 years ago

Just speculating, but: http://www.bittorrent.org/beps/bep_0005.html

jspenguin2017 commented 6 years ago

The protocol is based on Kademila and is implemented over UDP.

Unless we have low level control over the network socket, I'm not sure how are you going to use UDP... Let's not be too crazy here... Let's get differential update working first.

ameshkov commented 6 years ago

Sure, it just came to my mind when I've read about decentralization. What can be more decentralized than the bittorrent protocol + DHT:)

Back to the topic, it looks very nice. Just a couple of questions:

What's the point of having checkpoint.txt? Is it to compare with the new filter.txt contents?
Do you plan to store all the patches? What if a part of the patches go missing?
It'd be nice if the filter.txt somehow pointed out that it supports differential updates (and the current checkpoint number also). This will allow ad blockers to switch to the new scheme seamlessly.
How to force the full filter update from the server side?

The server returning something that looks like an HTML document counts as network error.

Checksums?

jspenguin2017 commented 6 years ago

checkpoint.txt is a checkpoint which may be older than filter.txt, patches are applied on top of checkpoint.txt and not filter.txt. Filter maintainers should edit filter.txt and use the tool to generate patches / checkpoints when appropriate.
No, currently 10 are stored, if something went wrong, filter.txt is the ultimate fallback. Once there are 10 patches, the next one will be a new checkpoint, where all existing patches are outdated and will be overwritten later. I designed it this way to generate least noise in Git history, I have another system in mind for file servers that doesn't use Git as creating a new checkpoint will generate a spike in traffic.
That will generate more noise in the patch file, and the maintainer will have to change the version instead of letting the tools to handle all that automatically. Given filter.txt to download, adblockers should try to read filter-diff/meta.json first, kind of like how Userscripts update works. If it is the first filter download and there is a network error, immediately fallback to filter.txt.
Create a new checkpoint.

The checksum library was just deleted from uBO... A checksum for expected filter after the patch is definitely something on my mind.

jspenguin2017 commented 6 years ago

If filter-diff/meta.json exists, the adblocker will download checkpoint.txt then patches to build the up-to-date (or very close to up-to-date) filter, as there is no guarantee that the last patch will build to filter.txt. Filter maintainers are encouraged to generate only 1~2 patches per day even if there are more changes.

If there is a parse error (filter-diff/meta.json has syntax error or it doesn't match the format), fallback to filter.txt.

Just like pre-processors, I'm looking at something small and incremental. I think something that fits the majority of volunteer filter maintainers is a good point to start. I think we should have a protocol_min_version in meta.json in case we decide to change things in a backward incompatible way.

jspenguin2017 commented 6 years ago

I wrote the filter minimization code for my server, although I still need to test and review it to make sure it won't crash my server. From my previous tests, the minimization can shrink normal filter lists by about 30%, still need to add the logic for hosts file, the localhost IP address isn't needed, I guess I can shrink it by a good 30% as well.

ameshkov commented 6 years ago

That will generate more noise in the patch file, and the maintainer will have to change the version instead of letting the tools to handle all that automatically. Given filter.txt to download, adblockers should try to read filter-diff/meta.json first, kind of like how Userscripts update works. If it is the first filter download and there is a network error, immediately fallback to filter.txt.

What I mean is a single meta data instruction added to the filter.txt, smth like Diffsmeta filters-diff/meta.json. It will also allow to change the default directory for the patches.

Otherwise, ad blockers will do unnecessary 404 requests.

I think we should have a protocol_min_version in meta.json in case we decide to change things in a backward incompatible way.

++

jspenguin2017 commented 6 years ago

@ameshkov The problem is if the instruction is inside the filter list, then the adblocker needs to download the filter list first, which is a full list that may not match any of the versions generated by the tools, so it will need to download the checkpoint file again, that is way more than a 404. Adblockers can store a flag for the 404 response so they won't keep looking for the meta file. I'm thinking something like 1 week "cooldown" between two requests.

ameshkov commented 6 years ago

@jspenguin2017

It needs to do it one time only, this is acceptable;
Most likely, it will download checkpoint anyway because the pre-compiled filters get too old rather soon;

jspenguin2017 commented 6 years ago

Yea... Sure.
filter.txt (the "pre-compiled" filter) is the file that filter maintainers edit, which is the most up-to-date version, I think minor fixes doesn't have to be built right away, so in my design, the patch file can be slightly out-of-date.

ameshkov commented 6 years ago

By "pre-compiled" I meant the filter.txt that's bundled with an ad blocker. In a few days that bundled filter gets too old and when user installs extension, it will anyway need to download the whole filter.

jspenguin2017 commented 6 years ago

My distribution pretty much always come with filters updated in the past week. But yea, chances are the checkpoint file still needs to be downloaded again. The filters can be marked as supporting differential update internally so filter.txt won't be downloaded for nothing. The tools will still generate output to filter-diff/ by default, but can be overridden if needed. I still need to document how overriding works... It internally supports that, just not documented. For the header, should it be ! Meta:, ! Diff:, or ! Diffmap:?

ameshkov commented 6 years ago

@jspenguin2017 tbh I'd prefer Diffmap, but it's not rly important

NanoAdblocker / NanoCore

Filter minimization and delta update: A bandwidth saving experiment #84