probonopd commented 8 years ago

AppImage is all about easy and secure software distribution for Linux right from the original upstream application author directly to the end user, without any intermediaries such as Linux distributions. It also supports block-based delta binary updartes using AppImageUpdate, allowing for AppImages that can "update themselves" by using information embedded into them (like Sparkle Framework for macOS). Consistent with this vision, we would like to enable peer-to-peer based software distribution, so that we would not need central hosting (such as GitHub Releases, etc.) while ideally maintaining some notion of a "web of trust" in which it is clear who is the original author of the software, and that the AppImage is distributed in the way the original author wants it to be distributed.

In this ticket, let's collect and discuss various peer-to-peer approaches, that could ideally be woven into the AppImageUpdate system as well.

"IPFS is the Distributed Web. A peer-to-peer hypermedia protocol to make the web faster, safer, and more open." https://ipfs.io

Should we use it to distribute AppImages?

davidak commented 8 years ago

That would be a really cool feature. When someone on my local network has downloaded the app already, i can download it from him. But it needs to be verified. Is there something like a has for every AppImage from upstream? Otherwise cheap IoT devices from china could send you infected AppImages.

probonopd commented 8 years ago

Like, a GPG signature? Currently these are separate files (outside of the AppImage), but we could also append them to the AppImage (=make them part of the AppImage).

davidak commented 8 years ago

I tried IPFS out the last two days and read a lot about. It has hashes integrated to find the right content. So you get the content you request.

Downloading files is easy. Here you get the lastest official Subsurface AppImage: ipfs get QmUH4SZVdBPekZXkE77ntLknAtAjuiKsHgEW6eJzioyQyD (you need to have the IPFS daemon running)

There is also ipget which included a IPFS node. ipget QmUH4SZVdBPekZXkE77ntLknAtAjuiKsHgEW6eJzioyQyD -o Subsurface-4.5.6-x86_64.AppImage https://github.com/ipfs/ipget

probonopd commented 7 years ago

Also check Ethereum https://www.ethereum.org/

davidak commented 7 years ago

@probonopd How would that help to distribute AppImages?

Another technology like IPFS is WebTorrent. You seed while you are on the website.

probonopd commented 7 years ago

@davidak not sure yet; didn't check it in detail yet.

Regarding WebTorrent, who stays on a single webpage for so long? Probably more suited to video distribution than apps.

probonopd commented 7 years ago

Check the Keybase filesystem: Public, signed directories for everyone in the world. https://keybase.io/docs/kbfs, very promising.

Every file you write in there is signed. There's no manual signing process, no taring or gzipping, no detached sigs. Instead, everything in this folder appears as plaintext files on everyone's computers. You can even open /keybase/public/yourname in your Finder or Explorer and drag things in.

And

Keybase can't be coerced to lie about your public keys, because each one needs to be announced, using a previous device or paper key. Together, these announcements form a chain that is announced in the bitcoin block chain.

But:

We're giving everyone 10 gigabytes. (...) There is no paid upgrade currently. The 10GB free accounts will stay free, but we'll likely offer paid storage for people who want to store more data.

probonopd commented 6 years ago

Also see https://twitter.com/taoeffect/status/925875220795219968

probonopd commented 6 years ago

Also see the Dat project https://datproject.org/ and the Beaker Browser https://beakerbrowser.com/ built on top of it. Also see https://twitter.com/probonopd/status/925106318796578818

pfrazee commented 6 years ago

A few thoughts from the Beaker Browser team

Dat uses pubkey-addressed archives, so anything distributed with it is signed. (There's also potential support for static hash-addressed archives.)
Dat also maintains a changelog in its archive metadata, so it's good for tracking history & versions.
Dat does not currently use differential diffs in its own updates, but there are plans to investigate that in the future.
We use Dat in Beaker to act as a website, but it can be any form of data storage. In the next version (0.8) we will have a built in user identity concept which will use Dat archives to represent users. That will eventually be a foundation for webs of trust in the application layer -- but it will take some time for the WoT networks to mature.
We're switching over to electron-updater in Beaker right now, which is great because it has pluggable transports. For us, the main reason we haven't distributed Beaker over dat is the need for auto-updates. I think now that it'd be fairly trivial to write a Dat transport for electron-updater and get all the behaviors we need.

TheAssassin commented 6 years ago

@pfrazee sounds promising. If you want to investigate binary delta updating, you can check out zsync(2), which is based on the same algorithms that rsync uses. It calculates a meta file for an existing file, containing a set of hashes (calculated by chunking a file into blocks with a specified blocksize and hashing the blocks using a specified hashing algorithm).

I'm sure it's possible for you to make use of the functionality in this library. Heck, I could even imagine zsync2 supporting Dat as a URL scheme.

TheAssassin commented 6 years ago

We use Dat in Beaker to act as a website, but it can be any form of data storage. In the next version (0.8) we will have a built in user identity concept which will use Dat archives to represent users. That will eventually be a foundation for webs of trust in the application layer -- but it will take some time for the WoT networks to mature.

I've thought a lot about Web of Trusts recently for application deployment (AppImage related), and they'll apply to generic content as well.

Often, PGP's WoT is used as a reference for working Web of Trusts. Their trust model works like, "I trust user A, and user A, B and C trust user Z, so I can trust Z, too, I guess." However, this trust model is only used to verify the authenticity of a key a mail you receive is signed with, the crypto itself does not depend on it, and even if a key has no third party signatures, it doesn't mean much to the security of the communication itself. In most cases, the users know each other anyway, and trust the keys in their mail clients by validating the keys' fingerprints manually. It's a nice idea, but isn't used by many people. Nowadays, you'd rather put your key ID into all mails you send, send them over a second channel (like a chat service or phone), or put it on your website, where people can get it and download and trust the key before writing and after receiving mails.

When building a WoT from scratch, one can use pretty much the same methods and structures PGP established. Sure, it'll take a while to get people to use it, and build a large base of trusted users so that a certain level of security is reached. The algorithms and structures are proven in the real world, and despite they haven't ever reached the majority of email users, they are secure and work fine.

However, no WoT is really immune against malicious attacks. It's fairly easy to manipulate a WoT. Let me give you an example: By creating a few thousands of keys who then sign each others' keys (not everyone's, that'd be too obvious) and keys of all the other users (that'll make them look even more valid), you can create accounts appearing trustworthy, but have been created by some software. Time's not a factor here, the software could've been running for weeks or months. The problem is that it is really hard to detect those as being malicious (attackers are pretty good at finding flaws in your code, especially when it's open source), and once they're in the network, there is no chance to get rid of them unless you have some central "blacklist" (which undermines the decentralization aspects of a WoT). Even if you'd support some decentralized "anti trust" feature (like some second kind of signatures which discredit a key rather than making it look trustworthy), 10 minutes of an attack could be enough to do a lot of harm in dependent systems.

Transferring those thoughts to application distribution, as said, 10 minutes can be enough for an attack to do a lot of harm for your users. As research in the field of anti virus shows, 10 minutes can be enough for something like ransomware or computer worms to spread across a lot of computers. This is similar to zero days, they can be fixed within the same 10 minutes, and even if the fix would be deployed immediately, the ransomware can have infected 100s of 1000s of computers and thus have dealt a lot of damage. I could provide a list of references, but as we've all heard of it before, I don't think it's necessary.

Therefore, I am trying to construct some more secure trust models for the AppImage ecosystem. For AppImage's updating mechanism specifically, we could inspect the key the old AppImage is signed with, and then check whether the new AppImage's key matches the old one. In that case, we can trust it this time, and perform the update. Otherwise, we can either reject the update, or show a big yellow warning and have the user decide on it. As long as the key won't change, everything will work smoothly, but if there should be an issue, we can protect the user from any kind of attacks.

For the desktop integration via appimaged or (even better) the desktop environment itself, I'd imagine a trust model similar the one PPAs on Ubuntu established. We'd allow users to trust keys AppImages are signed with by adding them to a separate user specific key ring. (Distributions could even ship with a global keyring, such as openSUSE with it's openSUSE build service, which builds AppImages and signs them with the OBS key). Whenever it finds a new AppImage with an unknown key, it could ask the user about whether they want to trust the key or not. AppImages provided by the same developer would then be trusted automatically, however, new AppImages (i.e., the ones not marked as executable already) could show a "first use" warning, asking the user whether they want to run the AppImage when they double click it. When implementing the trust model I suggested, an additional security layer is put on top of this very basic security mechanism. Whenever an unknown key is encountered, the pop-up could also ask whether you'd want to trust the key. If you e.g., check a checkbox, it'd suppress further warnings for this specific key, otherwise the AppImage would still be executable, but the DE could still spawn a warning that the key cannot be trusted.

AppImageUpdate could eventually implement the same idea, by issuing a warning for unknown keys, and once they are trusted and the new file's matches the old key, the upgrade will just be performed. On a key change, it should clearly state that the new key differs from the old one, and ask the user whether to trust the new one, and whether the old one should be removed.

I think that's a fairly secure trust model for AppImage, using some established structures, being not too complicated and easy to implement by users with our existing zsync based infrastructure.

TL;DR: Coming back to Dat, I don't think a web of trust will provide any real security to your users, for the reasons stated above. People should not ultimately rely on it, and for application deployment, where foreign code is supposed to be executed on others' machines, I would never ever rely solely on a Web of Trust. For static websites and other harmless contents, it might work to some extent, but thinking of a browser, when it comes to JavaScript, things get problematic again.

So, if you design a Web of Trust which is not subject to any of those issues, please make sure to notify us, because I'm really interested in the topic. If it'd fit our needs, I'll consider using it for AppImageUpdate, too!

pfrazee commented 6 years ago

We need to redefine the WoT away from how PGP defined it. The pure "human friends only" model is way too slow-moving, and the measure of transitive trust was a fairly limited form of graph analysis.

The new definition should be based on a set of features:

Pubkeys are used to identify all agents
App activity is published in signed cryptographic networks (BitTorrent, Dat, SSB, IPFS)
Rather than relying on in-person signatures, we bootstrap from existing CA-secured channels and rely on multiple overlapping signals to provide reasonable confidence

Cryptographic networks like Dat give a richer dataset to analyze. All interactions are logged in the network, and become signals for graph analysis. So, inconsistencies should be more detectable.

For instance, if multiple "Paul Frazees" start getting followed, a crawler should be able to notify me and I can react by flagging them. Then, as with any graph analysis, the computed trust is a matter of choosing good starting nodes (and doing good analysis).

For bootstrapping trust, we use auditable key distribution nodes, which ought to be the job of orgs and services. We can use auditable contract systems like nodevms to back these kinds of servers. They will then use CAs to identify themselves. So, again: a combination of CA-secured channels and app-generated trust signals.

Direct in-person signatures could still be used, perhaps initially only for high-risk tasks like software distribution. That would be the sort of thing where the user accounts of the org and devs have published special "trust" objects on Dat, which are in turn used by software-installers.

But-- that question is basically pushed into application space, since any app can decide how to do its trust analysis on top of the crypto networks. So, perhaps instead of calling it a Web of Trust, we need to think of it as a "Trust Platform," because we're putting trust signals into the application space as a primitive to work with.

Regarding the risk of the attack window, with any automated decision based on trust, such as installing software, there's always the option of putting in a time delay. "This software must be published for 24 hours with no 'withdrawal' signals from X set of users before being installable."

probonopd commented 6 years ago

What I mean with "web of trust" is really not specific to applications but I guess has been/needs to be solved for a peer-to-peer Web browser as well. After all, an AppImage is just a file, like a HTML file. In both cases I want to have certainty that what claims to be coming from, e.g., @pfrazee (just standing in as an example here), is actually coming from @pfrazee and has not been altered in between - be it a HTML page or an AppImage. The more difficult question is whether @pfrazee can be trusted - be it with information or software originating from him. An indication may be who else is following him.

So in summary, I believe a peer-to-peer Web browser needs to address the very same questions somehow, and if they are adequately solved for Web browsing, then we can also use the very same concepts for software distribution.

Agree?

pfrazee commented 6 years ago

I believe a peer-to-peer Web browser needs to address the very same questions somehow, and if they are adequately solved for Web browsing, then we can also use the very same concepts for software distribution.

@probonopd I think that's exactly right.

TheAssassin commented 6 years ago

You're right, it's probably better to avoid calling this "Web of Trust", as I guess many people associate PGP's model with that term. I have to admit I'm not too much into blockchain technology or stuff like smart contracts which are built on top of it.

All this sounds quite interesting, but also far from being mature right now, unfortunately. Is there a roadmap, set of definitions or specifications or any other data where interested people could get informed about your plans?

I'll be thinking about what you said about the trust model that an application scenario like "app update distribution" could put on top of it. I see what you mean with the withdrawal signals, but I couldn't imagine how to realize that, since there's a paradox: You don't want to publish updates until a "crowd-sourced" trust has been reached, but how would that be possible without pushing updates to at least some users? A/B like testing might work, but the e.g., 10% of users, who would receive the update right away are put at an unacceptable risk of getting malware on their systems (they might not even be able to push a withdrawal request into the network, depending on the effects of the malware).

Right now I'm not 100% sold of the concept, but I'm confident that a constructive discussion might lead to a working model. If you could point me to a place where you discuss those things, I'll have a look as soon as possible.

By the way, I think it might be worth to talk to some bigger projects like openSUSE, too, who provide trustworthy AppImages (they sign the AppImages they publish with their pubkeys, so they might be a reasonable institution to "seed trust" in the network.

All in all, Dat and Beaker sound interesting for distribution right now, but I'd leave aside its web of trust when implementing it in AppImageUpdate, I'd rather continue to use a more conservative trust model like the one I suggested.

TheAssassin commented 6 years ago

By the way, I'd like to invite you into our IRC channel, #AppImage on Freenode.

probonopd commented 6 years ago

What establishes trust today?

An "official" domain. Downside: anyone can go register getbeakerbrowsernow.org - which user checks the whois records?
https certificates. Downside: in reality, anyone who was able to register getbeakerbrowsernow.org will get https certificates for it
Google rank. Downside: SEO experts can game the system and get getbeakerbrowsernow.org in the top spot
GitHub stars. Downside: Given enough evil intentions, these can likely be faked too, but it takes much more effort because we can check out who starred a project. This has "web of trust" aspects
GPG or other types of signatures. Downside: Do I really know whether the signature belongs to who claims to be the person? Should I trust a GPG signature belonging to getbeakerbrowsernow.org?
Software in a distro repository. Downside: Does not scale. By far not every software on the planet will be in a distro repository in all versions, including continuous builds. For enterprise stable distributions, there is only outdated applications in the repositories. Hence, most software ends up in additional third-party repositories like PPAs or personal repositories on OBS. Who really checks their integrity?

What might establish trust in the future?

pfrazee commented 6 years ago

All this sounds quite interesting, but also far from being mature right now, unfortunately. Is there a roadmap, set of definitions or specifications or any other data where interested people could get informed about your plans?

No, this is just a set of ideas we're forming as we build with dat & beaker. I agree that it's too early to go into production-mode with using a new trust model on top of Dat. I think Dat's a great protocol to distribute images, but I'd still use existing code signature techniques on top of using Dat.

You don't want to publish updates until a "crowd-sourced" trust has been reached, but how would that be possible without pushing updates to at least some users?

That's not I'm suggesting there. You'd already have a trust network established for the release: that is, the pubkeys you trust to publish or withdraw a release. The purpose of the delay would be to give the owning orgs a chance to notice and react to a compromise in those trusted actors.

So, a simple example scenario that could work right now: you have an app you build, and the .appimage is signed by your dev laptop (1 sig). Somebody steals your laptop and publishes a compromised version. If there was a 24 hour delay before clients auto-download the update, that'd give you time to access the .appimage host and take down the bad version.

Same idea here.

By the way, I'd like to invite you into our IRC channel, #AppImage on Freenode.

Joined!

pfrazee commented 6 years ago

I wrote an article a while back, when I was working on SSB, that tried to summarize a lot of reading I did on trust and reputation analysis. It's overly dense, but the research I linked to was good http://ssbc.github.io/docs/articles/using-trust-in-open-networks.html

Reacting to some of your points @probonopd

HTTPS & DNS do have the problem you mention -- you can phish using "close enough" domain names. It happens pretty frequently.

Graph & reputation analysis - The issue of "SEO gaming" is true. The Advogadro project (see my article) had decent success. It depends on the usecase; if false positives/negatives are dangerous, then you can use graph analysis more as a suggestion.

Stars & user signals - If you filter the stars/signals by "people you follow" or "people in your network" or some similar tool, you improve the value of that signal, but lose potentially good source that you're just not connected. This is why you might want a single node to try to globally crawl and rank everybody -- they can potentially tell you which stars to trust and which ones not. How? Basically, what you're doing is having the crawler try to define the "best people in my network," and then use that set to filter signals such as stars (and therefore cut out the spam). Again, check out advogadro or pagerank (in my article). Graph analysis is a way to expand your network of trust without having to manually evaluate each new connection.

probonopd commented 6 years ago

Someone already had mentioned this idea 12 years ago in an article about klik (AppImage predecessor):

it's a good idea to integrate a p2p network on it, such as bittorrent, so that once it's popular, the servers aren't down because of too much people downloading, or you start getting problems of connection. It would be nive to kind of force people using p2p in this case.

https://dot.kde.org/comment/44508#comment-44508

probonopd commented 6 years ago

User stories

As a user in a local network, I want downloads (and ideally delta updates!) to be fast and efficient because not everything gets downloaded from the Internet, but instead as much as possible is downloaded from the local network. Ideally, I don't notice anything special when downloading and/or updating, except that things are fast even though my Internet line is not. This may be extremely important for users with local networks but slow network connections
As a user in a country from where access to GitHub Releases and similar locations is slow, I want to still have fast downloads and updates
As a creator of an AppImage, I want to have the option of just sharing it with friends without needing to upload it to some server. I want to be able to share AppImages without having to sign up somewhere, and without having to make the AppImage public to the world (i.e., only who knows the download link/hash/... should be able to download)
As an author of a popular software, I want to make it available to millions of users without having to worry about server cost or being billed or blocked from GitHub Releases and the like for over-usage
As the user, all of this needs to be super simple for me. I should not have to do anything else but to switch on the "Use p2p" switch
As another user, I don't want to use p2p for whatever reason; for me, the whole system should also function without it. I am fine with downloading from p2p-to-http gateways like https://ipfs.io/ipfs/ though because that is normal traffic for me
As a LibreOffice tester, I want to go from LibreOffice nightly to the next nightly without having to re-download the whole thing every day - just the few parts that have changed... (note: for this we have a working solution, currently using zsync2 and HTTP Range Requests)

This implies:

The same AppImage should always result in the same hash, so that if two users share the same AppImage (without knowing from each other) the network should be intelligent enough to treat the shared file as one and the same (ipfs does that)
The p2p mechanism should work with AppImageUpdate (the local and public ipfs daemons appear to support HTTP Range Requests so much of the existing logic could stay in place)
The zsync2 file needs to be stored at a mutable location, i.e., the author must be able to update the content of the same URI. https://github.com/ipfs/faq/issues/232#issuecomment-285661053

Option 1: ipfs

Written in Golang, which means one single binary runs without much hassle pretty anywhere.

There is even a C implementation: https://github.com/Agorise/c-ipfs

To be investigated: Just running the ipfs daemon without using it seems to significantly slow down other download traffic on the machine/in the network.

Setting up ipfs

'/home/me/Downloads/go-ipfs/ipfs' init
'/home/me/Downloads/go-ipfs/ipfs' daemon

Adding an AppImage to ipfs

Of course, appimaged would do this automatically if it detects ipfs is on the $PATH and/or is a running process.

/home/me/Downloads/go-ipfs/ipfs add -q '/isodevice/Applications/AppImageUpdate-8199a82-x86_64.AppImage' | tail -n 1
QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB

# Everyone who would add the exact same version of `AppImageUpdate-8199a82-x86_64.AppImage` would get the exact same hash
# TODO: Find out how the hash is calculated

Download this through the browser

http://localhost:8080/ipfs/QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB

https://ipfs.io/ipfs/QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB <-- global link

Works! But only as long as the machine is online. To change that:

http://ipfsstore.it/submit.php?hash=QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB

This will store it for 30 days and longer if someone sends BTC to the address displayed.

Now, to make this into a redundant cluster, we could set up https://github.com/ipfs/ipfs-cluster/ - since one can set up redundancy and automatic replication, we could probably use the cheapest hosting we can find...

Range requests are apparently supported: https://ipfs.io/ipfs/QmXoypizjW3WknFiJnKLwHCnL72vedxjQkDDP1mXWo6uco/wiki/Byte_serving.html

See web interface

http://localhost:5001/webui

ZeroConf

_ipfs-discovery._udp is already implemented for looking up other ipfs daemons on the local network, https://github.com/ipfs/go-ipfs/issues/520. Code: https://github.com/libp2p/go-libp2p/blob/e4966ffb3e7a342aaf5574d9a5c0805454c07baa/p2p/discovery/mdns.go#L24

It is not used to announce files on the LAN, however (we would need to do this ourselves).

Delta updates

https://ipfs.io/blog/17-distributions/ says:

It may also make downloading new versions much faster, because different versions of large binary files often have lots of duplicated data. IPFS represents files as a Merkle DAG (a datastructure similar to a merkle tree), much like Git or BitTorrent. Unlike them, when IPFS imports files, it chunks them to deduplicate similar data within single files. So when you need to download a new version, you only download the parts that are new or different - this can make your future downloads faster!

So, it looks like that while we can continue to use zsync it may not even be needed?

Deduplication between different AppImage files

Asked for opinions re. intelligent chunking for better deduplication on the IPFS forum, https://discuss.ipfs.io/t/ipfs-for-appimage-distribution-of-linux-applications/1553

On IRC #ipfs, someone pointed out:

probono > Could we have IPFS do the chunking of the Live ISO's squashfs based on the individual files that make up a Linux Live ISO? (Or AppImage) whyrusleeping > kinda like the tar importer probono > whyrusleeping: with the tar importer, can i get the "original tar" back out of the system? probono > which a matching checksum? whyrusleeping > probono: yeah, with the tar export command

Potential AppImage workflow

User installs ipfs using whatever method he wants (e.g., we could also bundle it in the appimaged AppImage)
User opts into p2p sharing
We could optionally check the AppImage for metadata (e.g., license information) that allows p2p sharing
appimaged execs ipfs add -q 'Some.AppImage' if it is on the $PATH
appimaged gets back QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB
For LAN: appimaged announces QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB on the local network with Zeroconf (probably in a JSON feed together with some metadata such as the filenames etc.)
For WAN: zsyncmake2 calculates QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB as well and puts it into a custom header like X-ipfs-hash
For WAN: zsnyc2, when seeing X-ipfs-hash and when having ipfs on the $PATH, tries downloading from http://localhost:8080/ipfs/QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB; else downloads as usual; if that fails, downloads from https://ipfs.io/ipfs/QmZKVvm9jdF7TTfg8LEWMMsoinDxJEFMVybfzGUfs3dkKB

https://github.com/ipfs/faq/issues/59

Option 2: Hook libp2p into zsync2

Viable? Pros? Cons? https://github.com/TheAssassin/zsync2/issues/15

Option 3: dat

To be written

Written in nodejs, which means npm and friends are needed to set it up. A C library is still in a very early stage: https://github.com/mafintosh/libdat

Pros and cons

From https://docs.datproject.org/faq:

How is Dat different than IPFS?

IPFS and Dat share a number of underlying similarities but address different problems. Both deduplicate content-addressed pieces of data and have a mechanism for searching for peers who have a specific piece of data. Both have implementations which work in modern Web browsers, as well as command line tools.

The two systems also have a number of differences. Dat keeps a secure version log of changes to a dataset over time which allows Dat to act as a version control tool. The type of Merkle tree used by Dat lets peers compare which pieces of a specific version of a dataset they each have and efficiently exchange the deltas to complete a full sync. It is not possible to synchronize or version a dataset in this way in IPFS without implementing such functionality yourself, as IPFS provides a CDN and/or filesystem interface but not a synchronization mechanism.

Dat has also prioritized efficiency and speed for the most basic use cases, especially when sharing large datasets. Dat does not make a duplicate of the data on the filesystem, unlike IPFS in which storage is duplicated upon import (Update: This can be changed for IPFS too, https://github.com/ipfs/go-ipfs/issues/3397#issuecomment-284337564). Dat's pieces can also be easily decoupled for implementing lower-level object stores. See hypercore and hyperdb for more information.

In order for IPFS to provide guarantees about interoperability, IPFS applications must use only the IPFS network stack. In contrast, Dat is only an application protocol and is agnostic to which network protocols (transports and naming systems) are used.

For investigation

Deduplication between different packages

Wouldn't it be cool if e.g., all AppImages containing Qt could deduplicate data? Check https://github.com/ipfs/notes/issues/84 where it talks about deduplication.

AppImageHub data

Coud probably be decentralilzed in a database as well, e.g., using https://github.com/orbitdb/orbit-db/blob/master/API.md

whyrusleeping commented 6 years ago

@probonopd the chunking we talked about in irc could be pretty useful here. As a quick hack I would be interested to see what sort of deduplication you get across different images using rabin fingerprinting: ipfs add -s=rabin. This uses content defined chunking and should ideally produce a better layout that the default fixed width chunking (at least for this usecase).

If you add two different files with the rabin fingerprinting, you could do ipfs refs -r <hash> on each (which lists each block) and see how many hashes are the same between each file.

KurtPfeifle commented 6 years ago

@whyrusleeping: If that indeed would work, it would be a pretty cool feat!

whyrusleeping commented 6 years ago

@KurtPfeifle do you have a list of images somewhere I could download and try this out on?

KurtPfeifle commented 6 years ago

@whyrusleeping:

A list of crowd-sourced AppImages and their respective download locations is here:

https://appimage.github.io/apps/

KurtPfeifle commented 6 years ago

...and here are LibreOffice AppImages in quite a few different combos: old releases, recent releases, daily/nightly builds -- all with various localizations enabled and combined:

whyrusleeping commented 6 years ago

Downloaded a random sampling of images:

why@whys-mbp ~/appimagetest> ls -l images/
total 4408472
-rw-rw-rw-@ 1 why  staff   25189240 Dec  3 19:08 KeePassXC-2.2.2-2-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  209290936 Dec  3 19:29 LibreOffice-6.0.0.0.beta1-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  333555384 Dec  3 19:28 LibreOffice-6.0.0.0.beta1.full-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  248202936 Dec  3 19:27 LibreOffice-6.0.0.0.beta1.standard-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  372242104 Dec  3 19:29 LibreOffice-6.0.0.0.beta1.standard.help-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:26 LibreOfficeDev-6.0.0-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:27 LibreOfficeDev-6.1.0.0.alpha0_2017-11-26-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:26 LibreOfficeDev-daily-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   27469496 Dec  3 19:08 Qt_DAB-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   28583608 Dec  3 19:09 Woke-de0a968-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   11702008 Dec  3 19:09 XChat_IRC-5d0dbe1-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   63766528 Dec  3 19:08 alduin-2.0.1-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   68878336 Dec  3 19:22 draw.io-7.7.3-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff   52363264 Dec  3 19:21 vessel-0.0.9-x86_64.AppImage
-rw-rw-rw-@ 1 why  staff  109641728 Dec  3 19:22 wire-3.0.2816-x86_64.AppImage

Then added them to a clean ipfs directory, and got the following results:

why@whys-mbp ~/appimagetest> ipfs add -r -s=rabin images/
... ipfs adding things ...
why@whys-mbp ~/appimagetest> du -h -d0 ipfs/
1005M   ipfs/
why@whys-mbp ~/appimagetest> du -h -d0 images/
2.1G    images/

Cutting the 'average block size' target in half to 128k results in slightly better results, but the cost you pay will be in slightly higher transfer times.

why@whys-mbp ~/appimagetest> ipfs add -r -s=rabin-128000 images/
... ipfs adding things ...
why@whys-mbp ~/appimagetest> du -h -d0 ipfs2
947M    ipfs2

The results were probably better than average due to the number of libreoffice images i included, but still pretty nice.

whyrusleeping commented 6 years ago

Also, for context, the default ipfs chunker:

why@whys-mbp ~/appimagetest> du -h -d0 ipfs4
1.4G    ipfs4

KurtPfeifle commented 6 years ago

Looks like these three LibreOffice downloads could even be exactly the same files, despite their different names:

 -rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:26 LibreOfficeDev-6.0.0-x86_64.AppImage
 -rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:27 LibreOfficeDev-6.1.0.0.alpha0_2017-11-26-x86_64.AppImage
 -rw-rw-rw-@ 1 why  staff  206177976 Dec  3 19:26 LibreOfficeDev-daily-x86_64.AppImage

Which would skew the results even more. But still pretty good!

whyrusleeping commented 6 years ago

@KurtPfeifle ah, great catch. let me remove those from the sample.

whyrusleeping commented 6 years ago

Two of them were the exact same file, the other was likely very similar:

88d17b863625b08eda45723dae81a866020b7615  images/LibreOfficeDev-6.0.0-x86_64.AppImage
5d785c15ab1e989e8b2605f67768527578ace031  images/LibreOfficeDev-6.1.0.0.alpha0_2017-11-26-x86_64.AppImage
88d17b863625b08eda45723dae81a866020b7615  images/LibreOfficeDev-daily-x86_64.AppImage

In #ipfs IRC, I mentioned that you could implement a custom chunker for AppImage files that would intelligently break the file up on known internal file boundaries as a way of maximizing deduplication. The internal ipfs interface for this looks like this which basically wraps a stream of data and provides a way for the caller to read it a chunk at a time (with whatever underlying logic you want). If you don't fancy writing go, you can write your chunker in whatever language you like, and build the ipfs graph manually via the api. Can someone link me to documentation on the AppImage format?

KurtPfeifle commented 6 years ago

The AppImageSpec is here: https://github.com/AppImage/AppImageSpec.

Its main content/payload is a SquashFS-compressed AppDir structure of files, which is prepended by a small binary called "AppRun".

You can invoke type 2 AppImages like any.Appimage --appimage-help, which will tell you that --appimage-extract will extract the payload into the original AppDir structure (currently this will land in a directory named squashfs-root).

KurtPfeifle commented 6 years ago

@whyrusleeping:

"Two of them were the exact same file, the other was likely very similar"

I'm wondering that the chunker did not realize that the complete files were identical and report it somehow?

What are the respective numbers for the tests (after removing copies) you reported about in comment further up?

whyrusleeping commented 6 years ago

It did realize, I just didnt notice. See here:

why@whys-mbp ~/appimagetest> ipfs add -r  images/
added QmSxi8GJRZeA5U9g4KZvJLAtyZ9iRMGtu1zjC2rPMYFrQ3 images/KeePassXC-2.2.2-2-x86_64.AppImage
added QmQBBEbx32uEAgTKLA1aXQJKFTVmkv1qdeuAEtwuDfgVbT images/LibreOffice-6.0.0.0.beta1-x86_64.AppImage
added QmNdXpac5RysqLYj51Px2ihFBSVZF7yaUop7f8vBKZZNMy images/LibreOffice-6.0.0.0.beta1.full-x86_64.AppImage
added QmaoEbTvrnzTgZCJU9c5CMvnKSUeDFGmEsV7cMciXvs4ue images/LibreOffice-6.0.0.0.beta1.standard-x86_64.AppImage
added QmXA8HarASPeEEWtn6Xq6MRzrRuqHjkqcGtvGcQhRQJMJB images/LibreOffice-6.0.0.0.beta1.standard.help-x86_64.AppImage
added QmdvcixcM86TePK8BQNQeA4Qfd9fnosaJpUwPaBPifkUmY images/LibreOfficeDev-6.0.0-x86_64.AppImage
added Qmatc8i6sqfEfSXqzowGComcbSi3rXjvAfRGP2pVxwH2rQ images/LibreOfficeDev-6.1.0.0.alpha0_2017-11-26-x86_64.AppImage
added QmdvcixcM86TePK8BQNQeA4Qfd9fnosaJpUwPaBPifkUmY images/LibreOfficeDev-daily-x86_64.AppImage
added QmaLZdCrxgtDmosad5RLVyNkhWXkETFaQBjSinyoG8x9Y2 images/Qt_DAB-x86_64.AppImage
added QmVyBQiDmozp49h3F2yVC5AHJcPTfywyeaUXFQPmhuzq4C images/Woke-de0a968-x86_64.AppImage
added QmdDD7VUdmraqrCsrJokDGLVDMD1SCRDakUinGfAmAzCim images/XChat_IRC-5d0dbe1-x86_64.AppImage
added QmWFzH1vuZpbZa8Qr38Pzt8AyakifthBByKd68UYtgbFev images/alduin-2.0.1-x86_64.AppImage
added QmbacmHuFDccQ2emSAKhYKQzwGrYeFSLXiUn6ydLuj81yi images/draw.io-7.7.3-x86_64.AppImage
added QmabkdT6QaWVEcNc2VQXGHKr2fVwsZRtppy7XGow29v5BZ images/vessel-0.0.9-x86_64.AppImage
added QmbbsV2VSCPjuS8AGRC9Po2n6sAreEQAkjBv43vEsUF69e images/wire-3.0.2816-x86_64.AppImage
added QmcWoJBRet9rvZKGHen9WCrRTRdUKZKvMSLLZfr4PnQMM5 images

The hashes of those two files are the same (ending in UmY). which means they are only stored once.

probonopd commented 6 years ago

@whyrusleeping

In #ipfs IRC, I mentioned that you could implement a custom chunker for AppImage files that would intelligently break the file up on known internal file boundaries as a way of maximizing deduplication.

Indeed, this sounds like the right thing to do.

The internal ipfs interface for this looks like this which basically wraps a stream of data and provides a way for the caller to read it a chunk at a time (with whatever underlying logic you want). If you don't fancy writing go, you can write your chunker in whatever language you like, and build the ipfs graph manually via the api.

An AppImage is basically a squashfs filesystem image prepended by a small ELF executable (which mounts the AppImage using FUSE when it is executed, and runs the application contained in the squashfs filesystem).

As for the squashfs filesystem itself, we currently are using Exportable Squashfs 4.0 filesystem, gzip compressed, data block size 131072. That is probably also not ideal because the blocks do not fall on the boundaries of individual files.

We are not married to that, in fact we could but could use other squashfs variants using zlib, lz4, or xz and have been considering a switch to Zstandard compression for the squashfs filesystem.

To complicate things a bit, we are also already doing binary delta updates over HTTP(S) using zsync, which also works by chunking and checksumming the chunks. And uses different block sizes...

Reminds me of this:

Handling for compressed files. rsync is ineffective on compressed files, unless they are compressed with a patched version of gzip. zsync has special handling for gzipped files, which enables update transfers of files which are distributed in compressed form.

We have never come around to understand and/or use this so far in zsync2.

So in short, we could really need the help of someone who understands these things better than us.

Can someone link me to documentation on the AppImage format?

AppImage spec (Type 2 image format is the most recent one, but if beneficial we could do a type 2): https://github.com/AppImage/AppImageSpec/blob/master/draft.md#type-2-image-format
AppImageUpdate uses the zsync protocol, our implementation is at https://github.com/TheAssassin/zsync2

probonopd commented 6 years ago

Answer above updated with details @whyrusleeping.

shouhuanxiaoji commented 6 years ago

wish a solution support http direct download and p2p download as well , so if when p2p speed download slowly, it can auto-switch to http download

probonopd commented 6 years ago

Point in case:

Cross-reference: https://github.com/probonopd/uploadtool/issues/28

probonopd commented 6 years ago

On IRC:

"We now have four node servers in Hongkong, Shanghai, Beijing, Singapore, to speed up the download, and it can become a IPFS node immediately."

So, we need to find a way to tell these servers which AppImages to download and pin. What is the best way to do this?

First Idea: As part of the automated quality control we do on AppImageHub, calculate the IPFS hash. Then the cluster could pin these hashes. Not the best idea. Because that way we would have to run this on every version of every AppImage.

Second idea: So we need to find a way for the person who generates an AppImage (or a new version of it) to submit a permalink (=IPNS hash that always points to the latest version) to AppImageHub. AppImageHub would then get the AppImage from there, and if it passes validation, store the IPNS hash in a list that the IPFS cluster could pin.

Third idea: Can we use a p2p database for this and replace the central checking at AppImageHub with something distributed?

hsanjuan commented 6 years ago

@probonopd The difference between First Idea and Second Idea is that ipfs-publishing is just opt-in in the second Idea, while it happens for everyone on the other case, or is it more subtle? Perhaps you mean that only certain versions (say marked as stable) are meant to be pinned?

If I get it correctly, it would seem that AppImageHub needs to fetch the AppImage and validate in any case.

Maybe you can use ipfs pubsub (https://ipfs.io/blog/25-pubsub/) so that everyone can whisper new appimage hashes mean to be pinned. Once validated you can run ipfs-cluster to pin things in multiple servers. You can do it without ipfs-cluster too, but it provides a nice layer for maintaining a pinset in multiple locations.

Or you could run something similar to our IRC pin-bot like we do (https://github.com/ipfs/pinbot-irc) as the interface for people to submit new hashes.

probonopd commented 6 years ago

@probonopd The difference between First Idea and Second Idea is that ipfs-publishing is just opt-in in the second Idea, while it happens for everyone on the other case, or is it more subtle?

Well, in theory we would like to have an entirely peer-produced, web-of-trust based solution for publishing "known good" AppImages. As long as such a solution does not exist, we would run AppImages though a (centralized) test and publish a list of known-good AppImages (or their hashes) from there.

Thanks for pointing to ipfs pubsub.

probonopd commented 6 years ago

On a related note,

Cloudflare goes InterPlanetary - Introducing Cloudflare’s IPFS Gateway

https://blog.cloudflare.com/distributed-web-gateway/

probonopd commented 6 years ago

Endless OS (a Debian-based distribution) is using a combination of OSTree, Flatpak, and Avahi to realize "peer-to-peer updates": https://github.com/endlessm/eos-updater

probonopd commented 5 years ago

Played a bit with ipfs today and it seems rather resource intensive, and a lot of steps are needed like:

go-ipfs/ipfs add -w -n --nocopy -q -s=rabin-128000 -r -- /some/directory # High CPU usage for a long time

QmWv...

# Need to pin them
go-ipfs/ipfs pin add QmWv...

# Need to find a way that new files that get added to the directory
# get shared automatically. Unfortunately I am not an IPFS expert at all
# (need to read some docs)

This may be important if you want to share your whole Applications directory, including new files that may be added to it all the time:

https://lwn.net/Articles/763492/

With dat, to share data, you basically only need to call dat share and you're done: that creates a magic URL and no data is moved around. History can be kept using an external archiver, which means data is duplicated, but it's not the out of the box behavior.

With IPFS, to share data, you would call ipfs add which will copy each file (or its chunks, I don't quite remember) to ~/.ipfs to be globally reference by the ipfs daemon. There's the filestore extension to workaround that, but it's not enabled by default. Furthermore, it's not clear to me changes in the original dataset are automatically tracked the same way they are in dat.

probonopd commented 5 years ago

Galacteek does something scarily clever, it's the first self-seeding AppImage:

galacteek

Hats off to you @eversum. You've beaten us to it ;-)

Now imagine if we had a mechanism for this properly built into the AppImage ecosystem... for all applications (that allow sharing due to their license terms)...

pinnaculum commented 5 years ago

@probonopd Thanks!

The idea came up after distributing the AppImages "the standard way" and finding it inadequate. Bundling go-ipfs in the AppImage gave me other ideas including using the filename as CID to enable self-seeding by using IPFS pinning.

I've just updated the AppImage.

probonopd commented 5 years ago

Investigate https://lbry.tech/ as a means for distributing software in a decentralized way.

probonopd commented 3 years ago

@antony-jr has a working proof-of-concept implementation ready :+1:

https://github.com/AppImage/zsync2/issues/24

How can we best take it from there?
What remains to be solved?

Reference: https://twitter.com/probonopd/status/1320054292275933184

probonopd commented 3 years ago

Was pointed to IPFS vs WebTorrent: What the value of using IPFS instead of torrent files?.

AppImage / AppImageKit

Investigate peer-to-peer AppImage distribution #175

User stories

Option 1: ipfs

Setting up ipfs

Adding an AppImage to ipfs

Download this through the browser

See web interface

ZeroConf

Delta updates

Deduplication between different AppImage files

Potential AppImage workflow

Option 2: Hook libp2p into zsync2

Option 3: dat

Pros and cons

For investigation

Deduplication between different packages

AppImageHub data