Increase update check efficiency

TheAssassin commented 6 years ago

I didn't find the original description, so I'll describe the idea again below.

I think it was @pamarcos who's requested a "custom .zsync file header" feature for zsync2, which I implemented a while ago. Using this feature, @probonopd and I designed a way to highly increase update checks' efficiency by adding such a custom header.

The idea is to embed some kind of randomly generated (well, the actual constraint is that it doesn't depend on the contents of the AppImage, see <insert link to original discussion>) UID (unique identifier) in the AppImage before building the .zsync file. The same UID is then added as a header field.

When checking for an AppImage, zsync2 has the option to download the file header only. When it sees the header containing the UID, it could be compared to the one embedded in the file (AppImageUpdate would need to read it from the AppImage and pass it to the ZSyncClient).

In case the values differ, a full update is triggered, even if the rest of the files are equal, at least the chunk with the UID needs to be updated.

The chance for collisions (i.e., the same value is used twice, although the files differ) is negligible, similar to the change for hash collisions.

I would suggest to use some "well known" algorithm to generate an identifier, e.g., UUID4. It can be stored efficiently in raw bytes in the AppImage, and stored hexadecimally in the .zsync file for readability.

The performance of the update check will improve a lot for large AppImage files, as hashing these files can be pretty slow, and they need to be hashed completely to ensure the file is up to date.

TheAssassin commented 6 years ago

TODO: open issue in the AppImageKit repo to first define a method how to embed and later read the UIDs from the AppImages. I guess we'll end up adding a section to the ELF header.

probonopd commented 6 years ago

As people want reproducible AppImages, I think we should use some sort of a checksum rather than a random UUID. What do you think?

TheAssassin commented 6 years ago

A checksum that depends on the contents won't work for various reasons. If I only found the original feature description...

pamarcos commented 6 years ago

@TheAssassin is it this one? https://github.com/AppImage/AppImageKit/issues/525

TheAssassin commented 6 years ago

Yes, thanks a lot, @pamarcos. I was looking in the repos of AppImageUpdate and zsync2.

pamarcos commented 6 years ago

Just to give more information about why I wanted to include custom information in zsync header, in my case it was to have the ability to have meta info that can be read in the client in order to distinguish whether we need to update or not. ATM zsync will always try to sync the file itself if it detects it's different (EDIT: added the bold words), which may not always be desired if the client version is newer than the other one.

Also, one could even provide the changelog into the header itself and present that to the user. Basically, it was to give developers the means to extend to whatever they find useful the zsync header ;). Thanks a lot for implementing it.

For instance, Sparkle framework checks the version and will only update if and only if, the client is older.

TheAssassin commented 6 years ago

ATM zsync will always try to sync the file itself, which may not always be desired if the client version is newer than the other one.

For instance, Sparkle framework checks the version and will only update if and only if, the client is older.

That's wrong. zsync2 (the CLI) performs an update check before attempting any syncs. The method is expensive, though, which is why we are looking for alternatives.

pamarcos commented 6 years ago

Let me rephrase it: "ATM zsync will always try to sync the file itself if it is differerent , instead of if it is newer"

The process to check whether it's different or not is what's expensive since it's basically doing the checksums for all the blocks

TheAssassin commented 6 years ago

@pamarcos zsync2 is a file syncer not a file updater, so syncing is what you want.

Not to mention there is a possibility to compare the timestamps of the files, which works great, but which you didn't want back when I introduced it.

pamarcos commented 6 years ago

@pamarcos zsync2 is a file syncer not a file updater, so syncing is what you want.

That's correct, and it serves it purpose really well. However, in the context of AppImageUpdate (since we're using it to update AppImages) and thinking of the UX, using just zsync without anything extra lacks the functionality to have a proper updater. My idea about extending the zsync header was to allow developers to decide what they need to check to know whether to download or not a version or even provide extra information to the user. We introduced the -j parameter in the CLI to check for updates without updating the file itself. That's right for any generic AppImage even though it's expensive.

With the current implementation of the AppImageUpdate lib we can easily do that, since all we need to do is get the zsync file (which has embedded the extra parameters of version and changelog), parse it and retrieve the version. If version is newer than client's version, we can ask the user whether to update or not, providing the new features and bugfixes.

Not to mention there is a possibility to compare the timestamps of the files, which works great, but which you didn't want back when I introduced it.

I didn't like comparing timestamps because I think it's not very realiable. Not many users are going to touch the file, but those who do will lose updating capabilities.

pamarcos commented 6 years ago

@probonopd

As people want reproducible AppImages, I think we should use some sort of a checksum rather than a random UUID. What do you think?

The zsync header already contains a checksum of the whole file:

zsync: 0.6.2
Filename: appimaged-i686.AppImage
MTime: Sun, 25 Mar 2018 01:35:35 +0000
Blocksize: 2048
Length: 332144
Hash-Lengths: 2,2,4
URL: appimaged-i686.AppImage
SHA-1: ea3a2208d1e06c72efc082864a788a94dbe38e20

I can only presume that calculating the SHA-1 checksum may be slightly faster than the adler+md4 checksum for every block, but I don't think that would make a huge performance improvement to what we have now.

antony-jr commented 6 years ago

@pamarcos How about semver in the header and the AppImage to find which is the latest one ?

TheAssassin commented 6 years ago

@pamarcos hashing the entire file will never be performant. Also, we cannot use a hash, see https://github.com/AppImage/AppImageKit/issues/525#issuecomment-344382418.

One thing that is never going to happen is that there will be any kind of version number involved in the update process. Version numbers are not at all reliable. And many projects don't use them. Also, they're bad for continuous releases.

Why do you think it's so hard for a developer to maintain a single static file (or symlink to a file) on a webserver?

pamarcos commented 6 years ago

One thing that is never going to happen is that there will be any kind of version number involved in the update process. Version numbers are not at all reliable. And many projects don't use them. Also, they're bad for continuous releases.

I completely agree. I was not suggesting to do so. I was just justifying why adding custom parameters into the zsync header was useful. I do not have the answer to how reliable check whether a file should be updated or not based only on the file if we're not using a hash. Using a UUID seems logic and fair enough to me, but it makes me a little uneasy since it wouldn't even be based on the file itself but just some random and large enough ID that is statistically unlikely to clash with another one.

For my use case, checking the version is more than enough to know whether there's need to update or not. Using this approach, we don't need to do the whole checksum for the file which is slower. There's no development that needs to be done to accomplish this since we can already do it. I understand this is not the case for a generic AppImage, so we need another solution that I don't have ATM.

Why do you think it's so hard for a developer to maintain a single static file (or symlink to a file) on a webserver?

I don't think is hard to maintain. We do have the zsync file updated for every single release. However, there are a few cons to using the "sync this file with this one" approach from the UX point of view:

Developer, RC and beta versions distributed to a small number of people are still gonna ask to be updated against the official release even if it's oldest. I know this may not be the case for everyone. Another solution could be having your client point to a different webserver for those versions or even disabling the update functionality.
Asking to update the AppImage without providing the user what has changed from one version to another one seems unfair. Users may choose not to download if they don't see the benefit of the new version. The other possibility would be to have a separate mechanism to download the changelog or some extra information for that version, but that would require a separate system to provide that information. If it's already embedded into the zsync file, it's very straightforward to do it, making it similar to what appcast.xml provides in Sparkle.

TheAssassin commented 6 years ago

@pamarcos please open separate issues, as your feature proposals are off topic regarding the new update check method. (We might not implement the method I suggested in the first post after discussing those, though.)

Developer, RC and beta versions distributed to a small number of people are still gonna ask to be updated against the official release even if it's oldest. I know this may not be the case for everyone. Another solution could be having your client point to a different webserver for those versions or even disabling the update functionality.

Hm, okay, I think we need a solution to support "multi channel updates" and ways to switch between these, or maybe something like prioritized lists of update information, which are checked sequentially, and the first that returns a "change" is used for the next update.

The problem, however, is, we need to find a method how this can be implemented in a way so it doesn't depend on any kind of versioning scheme. Suggestions welcome. I still believe the MTime header is the way to go. All these fears that the times might not be synchronized are a bit overstated. Time stamps are a core functionality for Internet dependent technology, and are easy to manage.

The way I would design this: you could specify a list of update information, e.g., zsync|stable-url:zsync|beta-url in a beta client. As long as there's no changes detected for the stable URL, the next URL will be checked, in this case the beta URL. As soon as a new stable release will be available, AppImageUpdate would download the changes from there. The only problem would be that we'd need to patch the version downloaded from the stable URL and append the beta URL again, otherwise a beta version would be "converted" to a stable one, which is undesirable.

I still feel this is overengineering. It should be possible for update system maintainers to publish the stable version's zsync file under the beta one's URL when there's a stable release. This way, no changes would be necessary to the AppImageUpdate code, as this is possible already (although, to be fair, the beta URL would need to be put into that AppImage to make sure beta releases will be downloaded in the future).

Looks like we need a conceptual discussion about the complexity we want in AppImageUpdate, there's a lot of pros and cons to consider. Same goes for the changelog "API". @probonopd once suggested that AppImageUpdate could fetch the changelog for releases on GitHub via the GitHub API without much hassle. If someone comes up with an idea for the other types, I'd work on that. Maybe we can just add a X-Changelog-URL: header or something.

Anyway, that's off topic. We should return to discussing the update check efficiency. Let's open separate issues for the other two topics.

pamarcos commented 6 years ago

@pamarcos please open separate issues, as your feature proposals are off topic regarding the new update check method. (We might not implement the method I suggested in the first post after discussing those, though.)

I think I didn't explain myself properly. I was actually not suggesting any feature, everything I stated should be possible if custom information can be embedded and retrieved into/from the zsync2 header. You said you already implemented that, so nothing else to be done 👍 . I haven't tried it yet, though.

I just justified the need for that and explained how I think it can be used to check for updates without running a hash check provided your app has a proper versioning scheme. In case someone else may find it useful. I should probably have given that piece of information in a different place because I see it caused confusion.

I still believe the MTime header is the way to go. All these fears that the times might not be synchronized are a bit overstated. Time stamps are a core functionality for Internet dependent technology, and are easy to manage.

Every single time I have worked with times in different time zones it's been a true headache with lots of corner cases to cover. Computerfile has a funny video dedicated to this. Even if you do get to get it working perfectly, mtime does not ensure the file is the same in any case. I actually prefer relying on an UUID rather than on that, but that's my personal opinion.

Anyway, that's off topic. We should return to discussing the update check efficiency. Let's open separate issues for the other two topics.

Agree, sorry for the confusion.

TheAssassin commented 6 years ago

@pamarcos I see, well, I think we need a better API to access the headers, I'll open an issue for that.

I actually prefer relying on an UUID rather than on that, but that's my personal opinion.

Cool. Now we just need to fix the issue @probonopd brought up regarding reproducibility of AppImages. I recently changed some minor detail in the way the squashfs image is generated, so that now AppImages that are built from the exact same AppDir will be bitwise equivalent. We should not break with that feature by using randomly generated UUIDs.

I think we might really get away with a hash sum that does not depend on the AppImage but only the squashfs image. Potential caveats involve that changes of the runtime or ELF header fields won't be recognized by AppImageUpdate any more. We could mitigate them to some point by combining this check with the MTime based method, i.e., even if the hash value is the same, if the MTime of the AppImage on the server is newer, an expensive update check is performed. That way, at least a majority of runtime changes will be noticed.

AppImageCommunity / AppImageUpdate

Increase update check efficiency #83