Open jspenguin2017 opened 6 years ago
thats cool. commands help when looking at filters. at the same time i assume this saves precious ram. about the hosting issue, i dont think this should be a huge problem. hosting is getting cheaper and cheaper and these filter lists arent big files either. maybe these kind of problems will all disappear once filterlists.org progresses. features like combining and preprocessing lists are already planned and a central place for hosting those filters can make hosting cheaper as well because you do not need to rent a vps for each individual file but just one that you can upgrade if needed. a usable vps can cost as little as 10$/year
This will not save RAM, but will save hard drive / SSD space, but that 1MB saved here will not help when your 128GB SSD is running low.
i dont think this should be a huge problem. hosting is getting cheaper and cheaper
That's exactly what I think, although it's the opposite of what gorhill thinks, he still have 5 days update cycle for uBlock filters and he isn't happy that Nano is puling it every 3 days instead. Well, it's his repository so I'll respect his decision, if he asks me to pull uBlock filters elsewhere I'll host them myself.
My VPS costs $5/month, with free 40GB/s instream, which makes it great at scraping websites and scanning for stuff across the Internet. Outstream isn't free though, so not a super good CDN, but if I minimize + gzip filters it'll not be a problem.
this problem could also be mitigated by adding a mirror directive so a filterlist can contain a list of mirrors that can be used to update if the primary one is down or to allow multiple primary mirrors for load balancing.
assets.json
does list multiple mirrors, although I'm not sure if that's just for preventing adding duplicate filters or it's actually used when the first entry is down. (I'll need to read more of the code)
there is not a very big need for this right now. if it will become neccesary one day i think it should be an option in the filterlist file itself so listmaintainers can deliver new mirrors on every update
I'll start with minimizing stuff, because that's a cheap way to reduce bandwidth usage. TBH if I can get differential update set up, I can have filters update twice a day and it'll still consume less bandwidth.
before you try differential updates i would try to search the web for a ready made solution because it is probably quite some work to do it properly
There are differential update engines out there for sure. I'm not crazy enough to go make one myself, don't worry about that... I was able to trim down 31% of file size just from removing comments though.
It'll be !@pragma
and only allowed for whitelisted filter lists, as I'm worried people can misuse it to hide filter rules or to track the user.
About saving bandwidth for filter lists is there a way to host extremely compressed zip/gz of filter lists and decompress on client/user side then use them? If there is a way to do this it could save a lot of bandwidth since filter lists are just text (at the cost of a bit more cpu from the user)
You can zip it in any way browsers accept, although custom compressing logic is possible, it's unlikely to offer more than what browsers natively supports.
You can use this mapping to make your filters shorter, although it will break compatibility with uBO.
it's unlikely to offer more than what browsers natively supports
I'm not sure I understand this fully...what do browsers natively support?
I was thinking you could zip it server side using a compression algorithm like PPmd that works really well for compressing textfiles. User will download this .zip instead of the usual .txt filter list and Nano will decompress it on the computer and then use the decompressed .txt filter list. Is this possible?
I think by text they mean more like English essay or ebook, for filter lists, I'm not sure how much better it will be compared to gzip. If we want to save bandwidth, differential update is more effective than better compression methods.
i wouldn't want to have too uncommon compression formats because if its too complicated for filterlist maintainers then they are not going to use it.
I'm looking for a differential update (also called delta update) library, but unfortunately we might have to make our own for JavaScript (or customize based on existing tools). Many native solutions, but all of them aim at binary patching. I think we might want to look at something like this: https://github.com/kpdecker/jsdiff
It'll not be easy to enable for filter list maintainers, as you kind of need to process the difference on server side.
A cheap and easy solution is obviously removing comments, how are we going to process the source map is another question though... I think the filter viewer should be a reliable way for users to know what is the exact content of filter lists they installed, so I probably won't auto-load source map. I'll come up with a good plan.
I think the only way this will be possible is if someone (like you) mirror all the filter lists (example grab them every 1 hour) and do the processing on the lists and then use these as source for 3rd party filters. I don't think many if any maintainers will change their habits of serving just .txt filter lists
The problem is a different compression algorithm will also require server side configuration, which is as hard to set up while being way less effective than delta update.
I was talking about differential updates too in my post (which I also believe is a better solution than good compression methods if it is possible to implement). Unless you think differential updates are possible to do without maintainers needing to do anything on their side?
Even if I support a different compression algorithm, filter maintainers still have to set up the tools to process their filter so the new algorithm can be used. Since they need to set up tools anyway, they can very well set one up that does differential updates. Differential updates need to be processed, ideally by a server, but manual processing would also work.
But what I'm saying is I don't think filter maintainers will be willing to set up these server tools and process the differences. So it would need someone to mirror the filter lists and process them and use these processed lists as the filter list sources/URLs in Nano.
@ameshkov Can we get your opinion on this? Do you think differential / delta updates for filter lists will save enough bandwidth to be worth the hassle?
@sebast889 It's not exactly easy to do, I'll definitely experiment with it.
I'm checking the code of https://github.com/kpdecker/jsdiff I think it'll not be that hard to set up, although customization of the code might be required, as we want the patch to be applied to a string, not a file.
If I'm implementing it in Nano, I'll host Nano filters, uBlock filters, malware domain list, and Peter Lowe’s list. When implemented, Nano will update those filters every day, the server will synchronize with source every 8 hours.
A high level (planned) implementation will be:
adtidy doesn't exist. i never heared of this site. it now redirects to adguard and the internet archive doesnt go back far enough
Hi @jspenguin2017
@ameshkov Can we get your opinion on this? Do you think differential / delta updates for filter lists will save enough bandwidth to be worth the hassle?
The problems with the current situation (let's take Easylist as an example).
So yeah, delta updates definitely do worth the hassle, if implemented properly. The proper implementation should better be file-based. Even mere versions check (point 2 in your plan) if it involves any server-side logic will cause quite a heavy load -> the server will cost a lot. With a file-based implementation, this all can be hosted on an affordable server (or even on GH pages).
adtidy doesn't exist. i never heared of this site. it now redirects to adguard and the internet archive doesnt go back far enough
We re-host all the filters on filters.adtidy.org
domain in order to ease their servers load, control the filters, implement filters versioning, and support hints.
The proper implementation should better be file-based.
Right, good point. that'll be easier to set up for filter maintainers. So a couple Node.js scripts? Since we are going with file based implementation, we have to use 404 for "no more update". I'm not sure if that's less processing power than a hash table lookup though. Although that's probably the only way if we want it to work with GitHub's static server.
I'll be experimenting with it. We also need to get gorhill on the boat, he has a big declined sticker on the issue in his repo...
So use 404 for "no more update"?
I'd say it'd be easier to host an "index" file with a list of filters versions.
Regarding the diffs map, it could be something like this (just a draft, it's night time here and I can't think clearly).
filter.txt
! Version 1.0.123
! Diffsmap filter_diffs.map
filter_diffs.map
{
"currentVersion": "1.0.123",
"diffs": [
{ "1.0.123": "diffs/1_0_123.diff" },
{ "1.0.122": "diffs/1_0_122.diff" },
{ "1.0.121": "diffs/1_0_121.diff" },
{ "1.0.120": "diffs/1_0_120.diff" }
]
}
If the history is not deep enough to build an incremental update, you should fall back to the old approach. If the diffs map is not available or the structure is invalid, you should "forget" about it and re-download the filter.
upd: timestamp and checksum should be somewhere there too
Yea, looks good. Although the overhead of an index file can be quite significant for smaller filters, I'm thinking of ways to reduce that...
Not that much compared to the savings. It eliminates additional requests, and every additional request is +0.5-1KB of data (headers) + up to 4-8KB on the SSL handshake.
I'm making a NPM module, the default output structure is like so (draft):
filter.txt
filter-diff/
meta.json
checkpoint.txt
1.patch
2.patch
3.patch
meta.json
{
checkpoint: 35,
latest: 37
}
filter.txt
will be for adblockers that don't know how to do differential update.
checkpoint.txt
is a checkpoint file which is a full (but can be slightly outdated) filter. Applying too many diffs may not be efficient.
When searching for update, the adblocker will read meta.json
which will give instruction on what it should do. checkpoint
is the version of the checkpoint, newest
is the latest version, in the above example, the adblocker need to apply 1.patch
then 2.patch
on top of checkpoint.txt
. 3.patch
is left-over file that is not overwritten (don't generate more git history than required).
If there is a network error, wait 5 minutes then try again, for up to 5 times (if the server is overloaded, fallback too fast will just make things worse).
The server returning something that looks like an HTML document counts as network error.
if there are too many network errors or there is a parsing error, the adblocker will bail out and read filter.txt
instead.
One problem is when changing checkpoint, there will be a pretty big spike in traffic, might want to keep the last two checkpoints so adblockers that are almost up to date won't need to re-download a big file just because the checkpoint changed. If filter delivering servers are scaled down thank to differential update, a big spike can cause problems, although for GitHub (Pages), it shouldn't be a problem.
I guess optimizing for GitHub is different than optimizing for special servers that don't use git.
I'm going to sleep now, I'll finish implementing the prototype tomorrow.
It's up on NPM, I didn't really test it, should start although not guaranteed. https://www.npmjs.com/package/nano-sync https://github.com/NanoAdblocker/NanoSync
I'm not sure if that's just for preventing adding duplicate filters or it's actually used when the first entry is down
After checking the code, Nano (and uBO) will indeed use the following URL in the event that the primary server is down, however, they will not reach to secondary server unless the primary server is down.
Nano (and uBO) will indeed use the following URL in the event that the primary server is down, however, they will not reach to secondary server unless the primary server is down.
mayby decentralization system is the best idea of this
@KonoromiHimaries
mayby decentralization system is the best idea of this
I assume by decentralize you mean WebRTC, the problem with WebRTC is that there needs to be a central coordination server, and most routers block incoming traffic. It will also be easy for people to track who is using Nano. So that's not happening.
Just speculating, but: http://www.bittorrent.org/beps/bep_0005.html
The protocol is based on Kademila and is implemented over UDP.
Unless we have low level control over the network socket, I'm not sure how are you going to use UDP... Let's not be too crazy here... Let's get differential update working first.
Sure, it just came to my mind when I've read about decentralization. What can be more decentralized than the bittorrent protocol + DHT:)
Back to the topic, it looks very nice. Just a couple of questions:
checkpoint.txt
? Is it to compare with the new filter.txt contents?filter.txt
somehow pointed out that it supports differential updates (and the current checkpoint number also). This will allow ad blockers to switch to the new scheme seamlessly.The server returning something that looks like an HTML document counts as network error.
Checksums?
checkpoint.txt
is a checkpoint which may be older than filter.txt
, patches are applied on top of checkpoint.txt
and not filter.txt
. Filter maintainers should edit filter.txt
and use the tool to generate patches / checkpoints when appropriate. filter.txt
is the ultimate fallback. Once there are 10 patches, the next one will be a new checkpoint, where all existing patches are outdated and will be overwritten later. I designed it this way to generate least noise in Git history, I have another system in mind for file servers that doesn't use Git as creating a new checkpoint will generate a spike in traffic. filter.txt
to download, adblockers should try to read filter-diff/meta.json
first, kind of like how Userscripts update works. If it is the first filter download and there is a network error, immediately fallback to filter.txt
. The checksum library was just deleted from uBO... A checksum for expected filter after the patch is definitely something on my mind.
If filter-diff/meta.json
exists, the adblocker will download checkpoint.txt
then patches to build the up-to-date (or very close to up-to-date) filter, as there is no guarantee that the last patch will build to filter.txt
. Filter maintainers are encouraged to generate only 1~2 patches per day even if there are more changes.
If there is a parse error (filter-diff/meta.json
has syntax error or it doesn't match the format), fallback to filter.txt
.
Just like pre-processors, I'm looking at something small and incremental. I think something that fits the majority of volunteer filter maintainers is a good point to start.
I think we should have a protocol_min_version
in meta.json
in case we decide to change things in a backward incompatible way.
I wrote the filter minimization code for my server, although I still need to test and review it to make sure it won't crash my server. From my previous tests, the minimization can shrink normal filter lists by about 30%, still need to add the logic for hosts file, the localhost IP address isn't needed, I guess I can shrink it by a good 30% as well.
That will generate more noise in the patch file, and the maintainer will have to change the version instead of letting the tools to handle all that automatically. Given filter.txt to download, adblockers should try to read filter-diff/meta.json first, kind of like how Userscripts update works. If it is the first filter download and there is a network error, immediately fallback to filter.txt.
What I mean is a single meta data instruction added to the filter.txt, smth like Diffsmeta filters-diff/meta.json
. It will also allow to change the default directory for the patches.
Otherwise, ad blockers will do unnecessary 404 requests.
I think we should have a protocol_min_version in meta.json in case we decide to change things in a backward incompatible way.
++
@ameshkov The problem is if the instruction is inside the filter list, then the adblocker needs to download the filter list first, which is a full list that may not match any of the versions generated by the tools, so it will need to download the checkpoint file again, that is way more than a 404. Adblockers can store a flag for the 404 response so they won't keep looking for the meta file. I'm thinking something like 1 week "cooldown" between two requests.
@jspenguin2017
filter.txt
(the "pre-compiled" filter) is the file that filter maintainers edit, which is the most up-to-date version, I think minor fixes doesn't have to be built right away, so in my design, the patch file can be slightly out-of-date. By "pre-compiled" I meant the filter.txt
that's bundled with an ad blocker. In a few days that bundled filter gets too old and when user installs extension, it will anyway need to download the whole filter.
My distribution pretty much always come with filters updated in the past week. But yea, chances are the checkpoint file still needs to be downloaded again. The filters can be marked as supporting differential update internally so filter.txt
won't be downloaded for nothing.
The tools will still generate output to filter-diff/
by default, but can be overridden if needed. I still need to document how overriding works... It internally supports that, just not documented.
For the header, should it be ! Meta:
, ! Diff:
, or ! Diffmap:
?
@jspenguin2017 tbh I'd prefer Diffmap
, but it's not rly important
Most users don't need to see the content of filter lists, so there is little reason to keep comments in the distributed file. However, this can make development difficult, so Nano will support a pre-processor directive that points a compressed filter to the development version.
!@pragma nano-srcmap=<address>
When the filter viewer sees this statement, it will try to load the source map of the filter, if successful, the source map is cached and displayed instead.https://github.com/gorhill/uBlock/issues/3406
Discussions about delta update are below