AdguardTeam / FiltersCompiler

A tool that compiles & validates filters
GNU Lesser General Public License v3.0
52 stars 12 forks source link

Add differential updates capabilities #192

Closed ameshkov closed 9 months ago

ameshkov commented 1 year ago

We would like to add support for differential updates to the filter lists, but for that the compiler should be extended pretty substantially.

Having differential updates will allow ad blockers download filter lists updates much more frequently.

Here's the idea:

  1. When it is building a filter list, the platform directory and its contents may already exist. Let's take a simple case, when we're running there's already this directory structure:

    platforms/android/filters/1.txt
    platforms/android/filters/1_optimized.txt
  2. If this is the case, here's what it does:

    1. Retrieves the current filter list version from the filter file.
    2. Calculates the diff between the old state of 1.txt and the new state of 1.txt. We should use standard diff format for that, not unified, no context: https://en.wikipedia.org/wiki/Diff
    3. The diff should be stored in separate files:
      platforms/android/filters/1.diff.json
      platforms/android/filters/1_optimized.diff.json

    The diff.json file format:

    {
        "diffs": [
            {
                "old": "1.0.0.0",
                "new": "1.0.0.1",
                "patch": "Here goes the diff"
            },
            {
                "old": "1.0.0.1",
                "new": "1.0.0.2",
                "patch": "Here goes the diff"
            }
        ]
    }

    Note, that the diffs array is sorted (older versions first).

  3. If the diff is larger than X bytes, we remove all diffs altogether. The diff.json will be empty: {} or { diffs: [] } or just an empty file.

  4. Ideally, I would like to make this mechanism available to all filter lists, not just the ones made by AdGuard. In order to do that we should allow indicating diff file URL in the filter list's metadata.

    We need two new metadata fields for that:

    1. Diff-URL: https://xxxxx.com/filter.diff.json -- the address of the file with filter list diffs.
    2. Diff-Expires: 1 hour -- expiration time of the filter list when differential updates are available. Note, that Expires continues to work as it was working before, i.e. once in a while AdGuard will do the so-called "full sync". However, we can and should increase it when differential updates are available.
  5. Also, I'd like us to provide a separate tool that builds diffs instead of having this all inside the filters compiler (which is very AdGuard-specific).

    Here's how it should work:

    ./filters-diff-builder
        --new-list-path="./filter.txt"
        --prev-list-path="./filter.old.txt"
        --diff-expires="1 hour"
        --diff-url="https://xxxxx.com/filter.diff.json"
        --diff-path="./filter.diff.json"
        --max-diff-size=200000
        --max-diff-history=30
    • new-list-path - path to the new version of the filter list. The list will be modified in result, Diff-URL and Diff-Expires will be added (or replaced if they're already there).
    • prev-list-path - path to the previous version of the filter list.
    • diff-expires - value of Diff-Expires to be used.
    • diff-url - URL where the diff.json file will be published.
    • max-diff-size - maximum size of a diff file in bytes. If the resulting diff.json is larger than the specified value, it becomes empty effectively disabling differential updates.
    • max-diff-history - maximum number of diffs stored in the diff.json file.

Changes to how filter lists updates are checked

  1. First of all, differential updates are only used when all three Version, Diff-URL and Diff-Expires are specified in the filter list metadata.
  2. It requires the ad blocker to store the original filter list text.
  3. Every Diff-Expires period ad blocker should download the diff.json and then apply the diffs.

How the diffs are applied

First of all, any error while applying a diff should disable differential updates. In this case the ad blocker should back off and wait until it's time for the full sync.

The algorithm should be the following:

  1. Find the current version in the diffs file. If the version is missing, count it as an error and stop differential updates.
  2. Incrementally apply every patch in the diff.json file starting from that version. While applying the patches, check that the "old version -> new version" chain is continuous and have no gaps.
  3. Verify the resulting filter list:

    1. Check that Version: value is the same as the last one in the diff.json.
    2. If there's a Checksum: field in the filter list, check that it is a correct checksum for the list contents.

When to do full sync

  1. If the user manually triggers filters update check, do the full sync.
  2. Generally, follow the Expires: field.

Remarks and Q&A

  1. Why standard diff and not something more "compact"? This is actually a very good question. Diff is not ideal since it stores the "removed" strings as well. On the other hand, diff has great advantages: everyone knows what it is, it is human-readable and there're plenty of tools that understand it. I suggest implementing an MVP that uses diff and see how large the resulting diff files will be and then decide.
  2. What are the recommended values for max-diff-history? I think that it depends on how often the list is updated. For instance, AdGuard Base filter is updated every hour and keeping last 30 diffs should be good enough so that people that use their computer on a daily basis could continue to receive updates every hour.
  3. What kind of traffic economy we're talking about? Actually, not that much if any. I did some preliminary testing with FiltersRegistry and it seems that the size of a 2-day diff for Base filter is usually under 100kb. The whole list size is 5MB. Currently, we download it once in 4 days so this is about 40MB per month. If we switch to differential updates with Diff-Expires: 1 hour + increase the full sync period to 10 days then in the worst case scenario we'll be downloading about 90MB per month. However, this is very unlikely since computers don't usually work 24*7 so I'd assume we can safely divide it by 2. Anyways, I suggest first implementing the filters-diff-builder and see what real numbers we will be getting and then decide on how to proceed.
ameshkov commented 1 year ago

@gorhill are you on board with this? If you are, any remarks? Would you like to change/improve the spec?

piquark6046 commented 1 year ago

I think that it can distribute a filters list very quickly with low traffic if the size of complied filters list is huge.

ameshkov commented 1 year ago

@piquark6046 yeah, that's exactly the motivation for this change.

piquark6046 commented 1 year ago

By the way, what is a cryptographic hash function Checksum used for the Checksum metadata?

ameshkov commented 1 year ago

@piquark6046 it's something very old:

    /**
     * Calculates checksum
     * See:
     * https://adblockplus.org/en/filters#special-comments
     * https://hg.adblockplus.org/adblockplus/file/tip/addChecksum.py
     *
     * @param header Header lines
     * @param rules  Rules lines
     */
    const calculateChecksum = function (header, rules) {
        let content = header.concat(rules).join('\n');
        content = normalizeData(content);
        const checksum = crypto.createHash('md5').update(content).digest('base64');

        return `! Checksum: ${stripEnd(checksum.trim(), '=')}`;
    };
gwarser commented 1 year ago

There will be no issue with request count?

piquark6046 commented 1 year ago

I don't have expertise in information security enough to claim. But, I suggest that the differential updates capabilities has to use SHA-256 or higher digests because MD5 hash function has its hash collision^1.

ameshkov commented 1 year ago

@gwarser good question, it depends on the CDN that's used. Generally, bandwidth is more important than the number of requests unless the CDN is too greedy. On the other hand, if we want to achieve faster filter updates, I don't see how we can avoid a huge number of requests.

ameshkov commented 1 year ago

@piquark6046 the current checksum is legacy from the old ABP times, any change to the algorithm now will break backwards compatibility so it'll require a new field. Also, tbh, MD5 is an good&old standard way of calculating checksums and it's good enough to prevent list corruption.

PS: it's not mandatory to the spec, just mentioned because some people may be using it.

DandelionSprout commented 1 year ago

The way I understand this, without a whole lot of experience with filters-diff-builder, it sounds like delta updates, which I do have experience with. I approve of this concept as it has been explained to me, but the OP makes it sound like the delta update process takes place within FiltersCompiler after it has already imported the lists from their sources.

If my quarter-way wild guess on that was to be correct, the process could be carried over to ABP if not for their quasi-'feature freeze', but adapting it in uBO would require some advanced work.

gorhill commented 1 year ago

are you on board with this?

Something that requires thoughts, so I will have to think about it while I will keep following the discussion.

ameshkov commented 1 year ago

but the OP makes it sound like the delta update process takes place within FiltersCompiler

It's actually unrelated, just a repo to open the issue, we'll create a new one if everything is okay with the spec.

filters-diff-builder is a new tool that will build diffs. The input for this tool is two filter lists versions: the previous and the new one.

gwarser commented 1 year ago

Did you know if ABP implemented this in some form? https://issues.adblockplus.org/ticket/6833/ ?

ameshkov commented 1 year ago

I am not aware of ABP implementation, I wonder is there any person from eyeo we can tag on GH and ask?

gorhill commented 1 year ago

See We are planning to implement a diffing mechanism that will work for any type of filter list.

ameshkov commented 1 year ago

Seems to be implemented for MV3 list versions.

Here's a list with diff updates support: https://easylist-downloads.adblockplus.org/v3/full/easylist.txt Diff: https://easylist-downloads.adblockplus.org/v3/diff/easylist/740f51eefafe121afbffdb2be4ef2f97b26cb6c3.json

I am not fully sure about that, but it seems to be a very MV3-specific implementation that does not handle our case. Static ruleset is the same until the extension is updated so the extension simply downloads this 740f51eefafe121afbffdb2be4ef2f97b26cb6c3.json every time and it returns a diff from the static ruleset to the latest state of the list.

gorhill commented 1 year ago

Ok here is how I see your proposal from the point of view of what is currently there in uAssetsCDN:

Retrieves the current filter list version from the filter file.

This means having to add a ! Version: filed in each list. Not an issue, update commits are now tagged and this would be the version for all lists -- the new ! Version: field would be updated only for lists which content has changed.

Calculates the diff between the old state of 1.txt and the new state of 1.txt. We should use standard diff format for that, not unified, no context: https://en.wikipedia.org/wiki/Diff

Ok.

I tried git diff --unified=0 2023.10.18.97 2023.10.18.408 > /tmp/diff.patch and I got a patch of 49K in size.

For git diff --unified=0 2023.10.18.408 2023.10.18.777 > /tmp/diff.patch and I got a patch of 8.4K in size.

For git diff --unified=0 2023.10.18.408 2023.10.17.1128 > /tmp/diff.patch, I get 54K

The diff should be stored in separate files:

Since a client would not know which files changed, this means to try and fetch a diff file for every single asset.

From the point of view of uAssetsCDN, I am not sure about this given that all changes are committed in batch every few hours, currently every six hours, so it might be more beneficial to have (version => file => diff) in a single file to fetch. The size of the file would grow faster though.

An advantage of batch diffing is that it's beneficial in case of commits affecting many files, which is something occurring by design with uAssetsCDN.


The size of the file would grow faster though.

Probably best to have the "patch" be a file path to a patch rather than the patch itself with batch update -- so maybe it's best for uBO to have its own custom solution.

ameshkov commented 1 year ago

so it might be more beneficial to have (version => file => diff) in a single file to fetch

The spec can be extended to cover the possibility of a unified diff as well.

The unified diff.json could look like this:

{
    "easylist": {
        diffs: [....],
    },
    "ublock-unbreak": {
        diffs: [....],
    }
}

This would require adding a new meta field FilterId: to the list and allow multiple filter lists have the same Diff-URL.

An advantage of batch diffing is that it's beneficial in case of commits affecting many files, which is something occurring by design with uAssetsCDN.

With several files there's a little bit higher risk of a race condition, i.e. when the files on the CDN are changed when fetching diffs are in progress. On the other hand, this risk is still there when you're doing the full sync and fetch full lists one by one. The only chance to completely avoid it is to unite everything into a single file.

Probably best to have the "patch" be a file path to a patch rather than the patch itself with batch update -- so maybe it's best for uBO to have its own custom solution.

The point of the proposal is to allow all list maintainers to provide diff-updatable subscriptions. If we all implement custom solutions we only solve this problem for ourselves, but leave other lists without this capability. I am all for incorporating changes to the original proposal if this helps achieve a universal solution.

Regarding having paths instead of diffs, this indeed makes sense and allows to save lots of bandwidth. Initially, I have not added it to the proposal to keep it as simple as possible. On the other hand, taking another look at this all, it does not look too complex to me.

gorhill commented 1 year ago

I've given more thoughts on this, again from how things are currently done in uBO.

First an observation which I guess might apply to AdGuard too: it's is not possible to apply diff patch to list content which is expanded client-side, i.e. the !#include directive. In uBO, there are expanded client-side, then the final result is cached locally. This can't be diff-patched with a diff computed on the server, the server just doesn't know what is the end result client-side.

So this means that diff-patching can only be done for lists which content is not further modified client-side. This is do-able in uBO for select key lists where diff-patching would be more important than avoiding to expand an !#include directive.

Now regarding batch-commit currently being used in uAssetsCDN, I don't see the need for some central file describing patches etc. The way I see it, a simple field in the header of the file like Commit version: would be enough for a diff-patcher to know what to do next: just get the diff-patch from the server according to the content of the Commit version: field.

For example, if the commit-version field is 2023.10.20.407, then fetch patches/2023.10.20.407.patch from the server. If the patch does not exist (404 or empty file, whichever is best to spare the server), nothing to patch. If it exists, find the changes in the patch which applies to the list. Repeat for other assets (while keeping a copy of the patch for other lists which might need it).

Once done, repeat the process -- as a result of applying patches/2023.10.20.407.patch, the Commit version: field will have been updated, which mean we can repeat the above with the new commit version, until no more diff-patching required for any candidate lists.

This way the only requirement is to have a Commit version: field for lists which are candidate for diff-patching, no need for maintaining a file to be the authority to instruct which patches need to apply to which files.

To make things even fancier with this approach is to keep a short list of last commit versions applied. You may have noticed the commit version currently in use at uAssetsCDN are time-based, and given this the client code could infer the pace of scheduling diff-based update according to the average time difference between the latest commits, so essentially this allows to control server-side how often clients run diff-based updates.

Also, pruning a growing list of patches in patches/ would be simply a matter of deleting all .patch older than a specific time, or delete .patch except for the most recent n ones, or a combination of both, or any other heuristic. In the end, if the client requests a patch which does not exist, it's a noop, and the regular current update cycle will eventually force an update of a list.

So unless there is a flaw in my thinking I am not seeing, this is the approach I want for uBO, and I consider it to be specific, because I would know exactly where to fit this in uBO's code with minimal changes and I feel having to use an external framework would just make things more difficult on my side, i.e. take longer to reach the goal and require modifying core code path in uBO, which I rather avoid.

ameshkov commented 1 year ago

@gorhill

First an observation which I guess might apply to AdGuard too: it's is not possible to apply diff patch to list content which is expanded client-side, i.e. the !#include directive

Yeah, it does apply indeed. The way I see it a generic solution requires expanding #include after the patch is applied, i.e. requires storing the original non-expanded list text. It's doable, but not an ideal thing to do.

Regarding the recursive approach with the Commit version: field, it does make sense to me. It should be rather easy to maintain even for an individual filter list author. The only thing that I don't understand is how do you want to make it work with batch patches that you mentioned earlier? Aren't there edge cases that would need to be handled client-side?

I'll try to amend the spec to make it more like this recursive approach that you're going to use for uAssetsCDN so that if at some point you decide to support it, it was easier to do.

Just one more thing that I wanted to propose. Since these metadata fields may influence how different ad blockers process filter lists, I suggest using platform prefix for fields that are not universally supported. Something like uBO Commit version, ADG Diff-URL.

gorhill commented 1 year ago

how do you want to make it work with batch patches that you mentioned earlier?

Concretely,

Let's use ! Diff-patch version: instead of ! Commit version:.

The Diff-patch version value is kept in meta data describing an asset, let's name it diffpatchVersion for the sake of conversation.

For example in uBO there is already metadata associated with a cached asset: last time it was read from cache, time written to cache, "age" of the resource, etc. So diffpatchVersion is just a new value to keep track of for any given cached asset. Though I suppose if someone doesn't want to keep track of metadata structure, it simply becomes a matter of loading the whole cached content of the asset, then extract the Diff-patch version field whenever needed.

Then a diff-update cycle becomes a matter of iterating through the metadata of all cached assets. If an asset does not have diffpatchVersion value then skip over the asset.

If one does have such value, launch a diff-update operation, i.e. fetch the diff-patch as patches/${diffpatchVersion}.patch. If the fetch fails then skip over the asset.

Then parse and apply the patch -- that will require the most work to first write code to extract the specific subpatch which apply to the asset being patched, then code to apply the patch itself -- this is where a library makes sense. Keep fetched diff-patches around in case they are needed again for other assets during the current cycle in order to avoid fetching from server again.

For the case of one diff-patch per asset, then the same code above would work, except of course that ! Diff-patch version: would also contain the name of the asset, ! Diff-patch version: 2023.10.20.407.filters_filters.min.txt, so if you want to spec this, it should be that a client needs to be prepared that a diff-patch downloaded from a server may contain subpatches for many assets (again, code suitable for a library).

As you said originally, using standard unified=0 diff makes all this easy to spec and implement server-side, it just comes down to library code which could be used by AdGuard and uBO to apply a patch given a (diff-patch, resource name, text), where diff-patch would be standard unified=0 diff like:

Example diff ```diff diff --git a/filters/filters.min.txt b/filters/filters.min.txt index c79dca64..86639f9d 100644 --- a/filters/filters.min.txt +++ b/filters/filters.min.txt @@ -4 +4 @@ -! Last modified: Tue, 17 Oct 2023 19:43:41 +0000 +! Last modified: Wed, 18 Oct 2023 05:46:20 +0000 @@ -27,2 +26,0 @@ youtube.com##ytd-video-masthead-ad-advertiser-info-renderer,ytm-promoted-sparkle -youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i) -youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true) @@ -64,6 +61,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\, -youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i) -youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i) @@ -15844 +15836 @@ watashiwasugoidesu.com##.watas-bottom-vi -@@||freereceivesms.com/ados.js$script,1p +@@||freereceivesms.com^$script,1p @@ -23286,0 +23279 @@ y2down.cc##+js(nowoif) +@@||cdn.fuseplatform.net/publift/$3p,script,xhr,domain=quackr.io diff --git a/filters/privacy.min.txt b/filters/privacy.min.txt index 9f6cb1bf..f65844b3 100644 --- a/filters/privacy.min.txt +++ b/filters/privacy.min.txt @@ -7 +7 @@ -! Last modified: Fri, 13 Oct 2023 12:03:17 +0000 +! Last modified: Wed, 18 Oct 2023 05:46:43 +0000 @@ -514,0 +515 @@ natgeotv.com##+js(set, Visitor, {}) +www.lenovo.com##+js(aost, history.replaceState, injectedScript) diff --git a/filters/quick-fixes.txt b/filters/quick-fixes.txt index 22c0b3af..9912fc03 100644 --- a/filters/quick-fixes.txt +++ b/filters/quick-fixes.txt @@ -4 +4 @@ -! Last modified: Tue, 17 Oct 2023 20:54:25 +0000 +! Last modified: Wed, 18 Oct 2023 03:04:34 +0000 @@ -82,2 +82 @@ plagiarismchecker.co##[class][style*="display"][style*="block"]:has(a img[src^=" -! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/ -youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i) +! https://old.reddit.com/r/uBlockOrigin/comments/16lmeri/youtube_antiadblock_and_ads_september_18_2023/k1uf2s1/ - 03c8985d @@ -110 +108,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /"playerAds.*?gutParams":\{"ta -youtube.com#@#+js(json-prune-fetch-response, [].playerResponse.adPlacements [].playerResponse.playerAds [].playerResponse.adSlots playerResponse.adPlacements playerResponse.playerAds playerResponse.adSlots adPlacements playerAds adSlots, , propsToMatch, url:player?key= method:/post/i bodyUsed:true) @@ -119,6 +116,0 @@ youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?true.*?\}\}\]\, -youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i bodyUsed:true) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adPlacements.*?\"\}\}\}\]\,/, , url:player?key= method:/post/i) -youtube.com#@#+js(trusted-replace-fetch-response, /\"adSlots.*?\}\]\}\}\]\,/, , url:player?key= method:/post/i) -youtube.com#@#+js(trusted-replace-fetch-response, /\"playerAds.*?\}\}\]\,/, , url:player?key= method:/post/i) @@ -156,4 +147,0 @@ sankaku.app##+js(no-xhr-if, googlesyndication) -! https://github.com/uBlockOrigin/uAssets/issues/19982 -unsplash.com#@#div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div):remove() -unsplash.com##div[style^="--row-gutter"] > div a[href="/brands"]:upward(div[style^="--row-gutter"] > div) - ```

In that example diff, a ! Diff-patch version: field would also be changed as a result of applying the patch, so once all assets are iterated, we repeat the cycle until no patching occurs.

So in summary, a library which take as input the above example diff, the name of the resource to lookup in the patch (might be something which requires its own field too, i.e. ! Diff-patch name: /filters/filters.min.txt), and the text on which to apply the patch is what we can share and work on together. Once we have that code, we are both very close to supporting diff-update.


Forgot to mention, using a time-based version schema helps also to preemptively skip over a pointless fetch. As you see, I currently use year.month.day.minute-of-the-day to generate version. A client can use this to decide that since the version correspond to an asset updated only 37 minutes ago (just an example), it will skip over trying to diff-update the asset. This could be part of a spec (or not). As mentioned in previous comment, such version schema could also be used to dynamically infer schedule period.

ameshkov commented 1 year ago

if you want to spec this, it should be that a client needs to be prepared that a diff-patch downloaded from a server may contain subpatches for many assets (again, code suitable for a library).

Good point, a single patch can indeed contain diffs for several assets there, it's on the client to figure out which list it actually corresponds to and it's doable if the client takes into account the files locations (URLs).

I'll draft an updated spec taking this all into consideration. Just one thing to note, I still think that for a generic case we need to have Diff-Expires. Also, the list author may want to be able to control how often the ad blocker downloads diffs.

gwarser commented 1 year ago

Something I think worth checking is how CDNs react on fetching non-existing files - maybe 404 is cached and when someone try updating too soon we will end up with broken link to patch. For sure jsdelivr is caching tags permanently - linking to .../@latest/... will always return tagged version, so every "patch" will need to be tagged.

piquark6046 commented 1 year ago

https://github.com/AdguardTeam/FiltersCompiler/issues/192#issuecomment-1773830466 It seems that the following project can be helpful. https://github.com/List-KR/jsdelivr-purge

ameshkov commented 1 year ago

Something I think worth checking is how CDNs react on fetching non-existing files - maybe 404 is cached and when someone try updating too soon we will end up with broken link to patch.

It is usually cached and follows the same expiration policy as all other CDN resources.

Usually, CDNs provide an option to forcibly purge cache.

gorhill commented 1 year ago

I think diff -n is a better option than diff -U0: the resulting patch is quite smaller as it doesn't list the lines which are deleted, and that is ok for the current purpose, we are not trying to re-create older versions from newer versions of a list, only newer versions from older versions.

diff -n doesn't specify a format for multiple files per patch, but we just define one very simple, lines starting define a block for a specific resource:

f [filename] [expected sha1sum of result] [number of following lines making the RCS diff block]

A patcher just require to find the first instance matching the above, then extract the number of following lines as part of the block, then repeat the process, starting another search after the extracted block. Patching with the instructions using the RCS format is a very simple patching algorithm.


Re. checksum, it's from content which starts at first non-comment line to the end of content.

ameshkov commented 1 year ago

@gorhill please check out the proposal here: https://github.com/ameshkov/diffupdates

Moved it to a separate repo so that I could add examples there, but we can continue discussing it here.

I've made a few changes compared to what we discussed, please let me know what you think.

  1. First, I propose to call the metadata field Diff-Path. Seems to me more "neutral" and flexible than Commit version of Diff-patch version.
  2. Second is regarding this part:

    Re. checksum, it's from content which starts at first non-comment line to the end of content

    Tbh, it looks like a complication to me. Also, if we take into account that the patch actually changes the metadata block by changing the Diff-Path value it looks strange to not validate it with a checksum. In the proposal and its examples I changed it to SHA1 sum of the whole resulting file including the metadata block.

gorhill commented 1 year ago

please check out the proposal here: https://github.com/ameshkov/diffupdates

Nice work.

I changed it to SHA1 sum of the whole resulting file including the metadata block.

You're right, it works with whole file given that the checksum is part of patch information, not in the result file itself -- got lost in my thoughts.

Gave all this more thoughts and taking into account the content of your spec I am thinking of a different header for the patch block:

name:$PATCH_NAME lines:$PATCH_LINE_COUNT checksum:$FILE_CHECKSUM

Since you state the f field is optional, let's just drop it.

I think using named fields is more future-proof. Regarding that checksum I think the spec should be that the provided checksum only require to match the start of calculated checksum client-side (i.e. clientSideChecksum.startsWith(patchProvidedChecksum)), so that we do not need all the 40 characters, only the n first provided by the patch. Personally I think using only the first 10-character is just fine to detect corruption in patched file, but since we only need to match the start, how many characters to use is left to the server.

Regarding:

filename: a name or a relative path of the file to be patched

How would a client-side patcher know whether it's a (arbitrary) name or a relative path? For my use case, a name simplifies a lot because internally all filter lists are handled through a token, not a URL -- the cached content is looked-up from the token internally, since the same filter list can be hosted on different CDNs.

Because of this, the token to expect will be included in the filter list following the ! Diff-Path: line, i.e. something like ! Diff-Name:, but that one is not part of your spec. It could be that if a list contains ! Diff-Name:, use this to locate patch block, if not, use name: in patch as the relative path of the resource to patch.

ameshkov commented 1 year ago

How would a client-side patcher know whether it's a (arbitrary) name or a relative path?

Regarding Diff-Name, it makes sense to me and I thought about adding something like that. We can even make it mandatory for the case of a batch update. I'll just update the spec with one additional security precaution: the list that's getting patched must have this exact patch file specified in Diff-Path.

Handling ambiguity with Diff-Name does not seem necessary apart from one case: when there are two different lists that point to the same patch file and have the same Diff-Name, this should count as an error.

Gave all this more thoughts and taking into account the content of your spec I am thinking of a different header for the patch block

Looks good to me, I'd just add a prefix there, something like that: filter-list-diff name:$PATCH_NAME lines:$PATCH_LINE_COUNT checksum:$FILE_CHECKSUM

Since you state the f field is optional, let's just drop it.

I actually meant that the whole diff header may be optional as it's only required for batch updates OR when we need to verify checksums. Generating a diff without a header is just a single diff -n command while adding a header requires a bit more work.

Here's what I suggest then:

  1. Make the diff header optional. If not set the patch will be evaluated as a single list patch.
  2. Make every named field of the header optional as well.
  3. Make name: always point to Diff-Name which is mandatory in the case of a list supporting batch updates.

Personally I think using only the first 10-character is just fine to detect corruption in patched file, but since we only need to match the start, how many characters to use is left to the server.

I'll think more about this one.

gorhill commented 1 year ago

filter-list-diff name:$PATCH_NAME lines:$PATCH_LINE_COUNT checksum:$FILE_CHECKSUM

I originally thought of using +++ but dropped it -- but if we want to use a prefix, maybe it should be agnostic, since who knows if we would want to support the same format for other assets than list in some future. In that case just diff seems ok to me.

ameshkov commented 1 year ago

Please check out the spec changes: https://github.com/ameshkov/diffupdates/pull/1/files

gorhill commented 1 year ago

Patches need to provide the time at which they were created. When repeatedly applying patches to bring a list up to date, there is a yet unknown number of patches to apply. Eventually the last patch will be fetched, but there is no way for the updater to know that this is the last patch, and this will result in a pointless fetch to the server, which will return with 404 or an empty file (as discussed elsewhere).

However if the creation time of a patch is part of the file itself, then the updater can infer that if (patchCreationTime + diffExpires) is in the future, there is no point to attempt a fetch. Currently on my side I use the patch name which is time-related, but if this is not part of the spec, 3rd-party solutions will choose differently and there won't be a way to prevent the pointless fetch at the end of the sequence.

So I propose a special line, the first one in the file, to provide that information:

patch time:[creation time]

Where [creation time] is the number of seconds since the epoch (we don't need ms resolution, and I prefer a single integer than a full blown date-time format). It could be optional, in which case an updater will end up causing pointless fetches as it won't be able to tell whether there is a good chance of a patch existing on the server, which is what (patchCreationTime + diffExpires) < now allows.

gwarser commented 1 year ago

There is already ! Last modified: in uAssets lists. Used to improve updates. Won't diff patch this too? It's "human-readable".

gorhill commented 1 year ago

Yes ok so when batch-updating, the minimum value of (lastModified + diffExpires) of all fetched lists would tell whether there is likely another patch available if the result is before now. Will have to see how this goes in practice next dev cycle.

gwarser commented 1 year ago

@ameshkov please fix the typo in issue title :)

capabitlities -> capabilities

gorhill commented 1 year ago

Unfortunately ! Last modified: is not part of any spec and the field name varied across lists, if present at all. For uBO lists I just went with what EasyList was using. Spotted in various lists:

! Updated:
! TimeUpdated:
! Last change:
! Last update:

At least if it's part of the spec here, we will know where to find the lastModified information in a reliable manner.

ameshkov commented 1 year ago

Tbh, the idea of relying on creation time does not seem reliable. When the patch is created it is known whether the patch is final or not and it can be somehow indicated in the patch file. The downside is that this approach requires removing this marker from the older patch files.

I propose extending our custom diff directive with a new optional field: final: bool. If every diff directive in a file has final: true, then the ad blocker knows that this is a final patch and there's no need to attempt downloading the next file.

gorhill commented 1 year ago

new optional field: final: bool

This means revisiting all previous patches to change this field to false.

Anyway, on my side it doesn't matter in the end as I found that recursively applying patches does not work, in retrospect it was silly to think this was going to work, and in the end the simplest solution (which I was intending to use anyway before realizing recursion does not work) is that each time a new patch is created, all the existing patches will be revisited and recreated with the diff of all changed files since the tagged release matching the patch vs current state of the repo. This means there will be always only a single patch to apply, and thus final field is not needed, and neither is the creation time.

ameshkov commented 1 year ago

as I found that recursively applying patches does not work, in retrospect it was silly to think this was going to work

What was the problem?

gorhill commented 1 year ago

What was the problem?

In the end, L1 is never updated because its patch pointer never points to P2, updates end up stalled for L1.

Possible solutions:

I am going with second one, since the benefit is one single request to bring all the lists up to date, instead of having to fetch all the patches in between. This also removes the issue about whether the code which fetch patches wonder if there is another more recent patch to apply: every patch always lead to the most current version of the lists.

gorhill commented 1 year ago

By the way, maybe we could dispense requiring a Diff-Name field if we use some notation to add diff name to Diff-Path, for example from what I have so far, it could be something like:

! Diff-Path: ../patches/2023.10.28.1127.patch:ublock-filters

or

! Diff-Path: ../patches/2023.10.28.1127.patch#ublock-filters
ameshkov commented 1 year ago

@gorhill

Possible solutions

Yeah, I was also thinking about that. I think we'll just won't be relying on VCS for generating patches and will be just looking at the list version + the previous list file. So if there're no changes the patch will stay empty. In our case relying on VCS is quite problematic by itself since we're generating different platform-specific filter lists versions and keep only "full" unfiltered one in VCS.

By the way, maybe we could dispense requiring a Diff-Name field if we use some notation to add diff name to Diff-Path, for example from what I have so far, it could be something like

I think we don't save much on this, but make it a little bit harder to quickly figure out if this is a batch-updatable list or not.

ameshkov commented 1 year ago

@gorhill Sorry for the late reply.

Batch updates will be mostly used by uBO filters so we should indeed to design it according to your needs. I'll update the spec accordingly.

Also, for the patch creation time, I also still think it's needed, and again, simpler if stored in the patch

Wouldn't it be better to standardize Last modified? This way it can be used even when the filter list does not support differential updates. I.e. if a list has Last modified then we assume that this is when the list was last updated. If a list does not have it, then we assume that the last modified time is the moment when it was downloaded.

ameshkov commented 1 year ago

@gorhill please check out the PR: https://github.com/ameshkov/diffupdates/pull/2/files

Kishlay-notabot commented 10 months ago

but the OP makes it sound like the delta update process takes place within FiltersCompiler

It's actually unrelated, just a repo to open the issue, we'll create a new one if everything is okay with the spec.

filters-diff-builder is a new tool that will build diffs. The input for this tool is two filter lists versions: the previous and the new one.

@ameshkov

Hi, I am a complete beginner so I apologise in advance for my lack of understanding. I read the diff-update repo in detail which you have created and I superficially understood the concept of the delta change acceptance and removal method of the old content in the filter lists. But here you have stated that the tool has 2 inputs, the old filter list and the new filter list. So I think that how does this make a change in the bandwidth consumed? Because, firstly I have a doubt that where will the delta comparison of the updated-filter-list.txt and the primary-list.txt (the older one) will occur ? And I assume this will happen on the client side obviously. And if this comparison has to happen in this algorithm, we have already fetched the whole filter list from the server in order to accomplish this. So doesn't this cancel out our attempt to reduce requests ? Because already we are downloading the filter list from the server as a whole and just accepting the changes?

105th commented 10 months ago

@Kishlay-notabot The comparison between the old and new filters will be calculated on the server side. When checking for filter updates, if new filters are found, a patch will be created. Subsequently, both the filters and patches are updated in the CDN. For a detailed explanation of how this process works, you can refer to the following link: https://github.com/AdguardTeam/FiltersRegistry?tab=readme-ov-file#how-to-build-filters-and-patches and this: https://github.com/AdguardTeam/FiltersRegistry/blob/master/scripts/auto_build.sh .

Kishlay-notabot commented 10 months ago

@105th Thanks for the links, will check them out!