ameshkov / diffupdates

Filter lists diff updates proposal
17 stars 2 forks source link

Replace Diff-Name with resource name #2

Closed ameshkov closed 9 months ago

ameshkov commented 10 months ago

See the discussion here: https://github.com/AdguardTeam/FiltersCompiler/issues/192#issuecomment-1792291460

Updated the spec according to the discussion:

  1. Diff-Name is replaced with a resource name specified as a part of Diff-Path
  2. Optional timestamp added to the diff directive
ameshkov commented 10 months ago

@105th yes, that was done intentionally so that we could check that the implementation is able to handle the case where there's no new line in the end of file.

gorhill commented 10 months ago

From my experience so far, the timestamp is no longer something I see as needed, it was when I thought patching would be done recursively, but it turned out this didn't work in practice -- best is to patch once, and to ensure this all existing patches a re-generated server-side once a new patch is created. This has the benefit of requiring a single fetch to bring all lists up to date even if the extension had not been running for a few days.

Now regarding ! Diff-Expires:, it tells us when to look again for a new patch, this means we need to remember when was the last patch successfully applied, and thus yet-another piece of information to store client-side. To avoid such need, I decided to encode the time of the patch in the name of the patch, and this works rather well in conjunction with ! Diff-Expires:: No need to store anything more client-side, and the code handling the logic on whether a patch is available or not just derive creation time from the name, adds ! Diff-Expires: and see if the time is in the future.

The timestamp value would accomplish this, but then it's a value to be stored, which I prefer to avoid. What is already stored is ! Diff-Path:, and if we can also store the timestamp in there, then we have all that is needed to get the patch creation time and the name of the list to patch.

ameshkov commented 10 months ago

@gorhill

Well, removing timestamp is not a problem. Generally, I think that the diff directive fields can be extended by any ad blocker the way they see fit, that's why the fields are named (I'd better mention this fact in the spec).

Regarding your approach with encoding additional information in Diff-Path, in my opinion it's too specific to be a part of the generic spec.

it tells us when to look again for a new patch, this means we need to remember when was the last patch successfully applied, and thus yet-another piece of information to store client-side

Tbh, I don't really understand the need to avoid storing and, what's more important, generating it on the client-side. When you generate it on the client-side, you are 100% sure that it's correct. Many things can go wrong on the server, you'll need to take possible unexpected input into account (the simplest example would be having date in the future). Finally, it ties your own hands: you'll have to generate patches exactly this way and you won't be able to change anything on the server-side once you deploy this version to the users (eventually, you will, but doing it without hurting users requires time and effort).

ameshkov commented 10 months ago

All that said, what do you think about standardizing Last modified: and possibly other important fields of a filter list (Title, Homepage, etc)?

I think the fact that there's no place on the Internet that defines what kind of fields can be actually used by maintainers does not help. And it's rather easy to fix just by providing the universally supported guidelines (which we all kind of already do, but never wrote it down).

gorhill commented 10 months ago

It's difficult to discuss with "in theory" scenarios, so let's just be very specific about what's going through my mind. Here is the code that avoid fetching unnecessarily: https://github.com/gorhill/uBlock/blob/1.53.3rc0/src/js/diff-updater.js#L237-L241

That code derives patch time creation from patch name, then add ! Diff-Expires: to get a timestamp value, and if that timestamp is beyond now, no fetch is performed. That code must exist, avoiding pointless fetches to a remote server is important.

I think you are arguing that the patch creation time should be associated and stored with each asset to patch, and that the patch creation time is provided in timestamp field? (which would be the same value for all sub-patch in a batch patch file.) If not provided than make up one by assuming the time at which the patch was fetched is the creation time?

Many things can go wrong on the server, you'll need to take possible unexpected input into account

This already happened and it's really not an issue, it's pretty simple client-side, at the first unexpected outcome for any required step, diff-updater does nothing and in the end all fall back to the usual auto-update with ! Expires field.

ameshkov commented 10 months ago

That code must exist, avoiding pointless fetches to a remote server is important.

Absolutely, that's why Diff-Expires is there: "don't go for new patches until the old one expires". With batch updates the situation is a little bit more complicated as different lists may have different expiration.

Now that I am thinking about it I got another idea, we could make the patch itself the source of all necessary information. What if we add a new directive (single directive for the whole patch file regardless of how many ) that encodes this? We can then drop Diff-Expires, it won't be needed anymore. I think you even mentioned something like that before, right?

Something like this:

head expires:10600 timestamp:1699599870000

I think you are arguing that the patch creation time should be associated and stored with each asset to patch

I had two points actually:

My first point is that it's safer to generate the last-updated timestamp on the client-side because of the possible mistakes that could happen on the server-side. Talking about your code, two scenarios:

  1. The server at night failed to generate new patches and you always have expired patch -> triggers an update every time it's called.
  2. A minor mistake in the server script introduced at some point and the server starts generating dates in the past (UTC-5 for instance) -> you end up with always expired strings.

Is that all a major issue? Not at all, I'd say that it's a matter of preference, but with a client-side generated timestamp you're protected from these scenarios.

  1. My second point is that file name is not a good place for meta information because it will make it very hard to change this approach later on.
gorhill commented 10 months ago

head expires:10600 timestamp:1699599870000

What happens the first time you fetch a list? How do you know when to make a hopefully successful fetch for a patch? You have the Diff-Path field, but not yet the future time at which to look. This is where encoding the time in the name comes to the rescue. Currently I use 4 fields, but it could be a single one, in minutes since epoch (i.e. whatever.28327261.patch) or 5-minute chunk since epoch (i.e. whatever.2360605.patch), and if you want to get rid of Diff-Expires, then dot-separated values in the name, i.e. whatever.2360605.72.patch.

Clearly where we diverge is you have issue with encoding that information in the name, whereas I do not see this as an issue: if it's in the spec, why worry about changing this in the future? What could ever happen that this becomes a burden? If the spec says this is the way, then whoever want to properly implement diff-updating their list will have to just follow the spec or use available tools that does all this for them.

ameshkov commented 10 months ago

@gorhill

What happens the first time you fetch a list? How do you know when to make a hopefully successful fetch for a patch?

Good questions.

It just bothers me that both the expiration period and the timestamp are properties of the patch file and they should be the same in every list that refers to that patch file.

Let me put aside personal preference and analyze advantages and disadvantages, please correct me if I am wrong anywhere.

One important detail: if we start encoding information in the file name, I think we should encode the expiration period there too. As I said, I think it's a property of the patch file, just like the timestamp, so it'd be strange to store one here and one there.

All in all, I'd say that the disadvantage is not that serious since there's still a way to upgrade the spec and avoid issues.

Here's what I propose then:

  1. Encode both the timestamp AND the expiration period into the file name. You're using a humand-readable format now to encode the timestamp, but it's precision is just 1 hour so we'll never be able to have differential updates more often than 1 hour and at times it'll be 2 hours. Let's switch to epoch and seconds instead: ${patchName}-${diffExpires}-${epochTimestamp}. It is not human-readable, but I don't think that it's a big issue to be honest, you'll still have the updated Last modified inside the patch file.
  2. List all edge cases that we already know about in the spec: how to handle invalid format, how to handle dates in the distant future or in the past.
  3. There was a very good comment about distributing the load and avoiding spikes. When using a CDN it's usually not too important, but generally spikes can be avoided by scheduling the "update worker" correctly.

UPD I read this and realized that I basically repeat what you proposed. Minutes since epoch is also okay I guess.

gorhill commented 10 months ago

${patchName}-${diffExpires}-${epochTimestamp}

It's me nitpicking, personally I prefer the expire value to come after the creation time, it reads more naturally in my opinion: this patch was created on epochTimestamp and expires in diffExpires.

Minutes since epoch is also okay I guess.

Could even be chunk of 5, 10 or 15, i.e. whatever resolution is most certainly the minimum ever needed. This would shorten a bit the string. Resolution could be even left to the patch emitter:

${patchName}.${resolutionInMinutes}-${epochTimestamp}-${diffExpires}.patch

Where both epochTimestamp and diffExpires must be multiplied by resolutionInMinutes. I can imagine a patch emitter not caring about emitting at hours interval only:

thelist.60-472216-6.patch

Whereas implicitly in minutes this would be:

thelist.28332968-360.patch
ameshkov commented 10 months ago

Please check the updated spec.

@gorhill I've incorporated most of what we talked about.

Notable changes: