AppImageCommunity / zsync2

Rewrite of https://github.com/AppImage/zsync-curl, using modern C++, providing both a library and standalone tools.
Other
132 stars 25 forks source link

Investigate libp2p #15

Open probonopd opened 6 years ago

probonopd commented 6 years ago

Given that zsync2 is already doing hashing of blocks, how complicated would it be to add libp2p into the mix?

https://github.com/libp2p/specs

(Not asking to implement this yet, just asking to have a look at it to get a rough understanding what would be needed and what we would gain from it, especially when compared to just using ipfs as-is)

TheAssassin commented 6 years ago

@probonopd please elaborate. libp2p is just the base of the actual IPFS implementation, but seems pretty low level, also you linked to the specs which would be relatively uninteresting for an actual implementation of IPFS.

probonopd commented 6 years ago

Well, the thought goes like this: AppImageUpdate already uses zsync2 which already chunks and hashes files. "All that is missing" is a p2p transport instead of using HTTP Range Requests. Can libp2p give us just that piece, and wouldn't that be a lightweight way to add p2p capability into the system?

As an alternative, we can leave it like it is now, and use ipfs which can serve HTTP Range Requests. But it kinda feels awkward to have 2 chunking and hashing mechanisms.

probonopd commented 6 years ago

For a p2p mechanism, we also may want our chunking mechanism to be aware of the content, that is, not use fixed block sizes over which the checksums are calculated, but something per-file. This way, the same library that is in more than one AppImage would get the same hash, and would be stored on the p2p network just once, hence greatly increasing its availability... and speed, right?

TheAssassin commented 6 years ago

Re being content aware, I've finally got to creating https://github.com/AppImage/AppImageUpdate/issues/53, which discusses content aware chunking trying to improve the overall efficiency of our update system. This is something I wanted to suggest for at least a month already. Your issue reminded me of it.

Actually, squashfs should generate the same set of blocks for an equal file (adding some padding to fill up the remaining bytes in a block, before inserting the next file), but with the current set of parameters used for both zsync2 and mksquashfs, it doesn't seem to be performing as efficient as it should, which is why I want to invest some time, measure the efficiency, and optimize the parameters to suit our well defined needs and increase efficiency to the maximum of what is possible.

Re. IPFS, I'd have to have a closer look, I've heard of the system, but don't know how it works. It doesn't seem to be inefficient at all, but instead of a block-wise distribution approach, it uses a file-based approach, i.e., is capable of synchronizing and sharing a single file, but you can't make use of the blocks of previous files etc.

That kind of deduplication is what zsync2 provides. If you'd want to "marry" both approaches, as you say, the only addition we could bring to IPFS is making use of existing data on a system, but I can't think of a way on how to efficiently implement this. Downloading only using zsync2 and afterwards sharing using IPFS seems to provide this kind functionality already, actually. But as mentioned previously, further investigation is needed anyway.

Adding low priority for now. I actually think this issue is blocked by https://github.com/AppImage/AppImageUpdate/issues/53 anyway.

antony-jr commented 6 years ago

@probonopd since we do not require http in p2p , i.e in p2p the client is also a server. Therefore we can implement rsync directly without zsync. Correct me if I'm wrong ?...

And also note that we need a central indexing server for the peers and thus makes it hard for a normal user to setup updates.

From Zsync Technical paper

There are alternative download technologies like BitTorrent, which break up the desired file into blocks, and retrieve these blocks from a range of sources [[BitT2003]]. As BitTorrent provides checksums on fragments of file content, these could be used to identify content that is already known to the client (and it is used for this, to resume partial downloads, I believe). But reusing data from older files is not a purpose of this data in BitTorrent — only if exactly matching blocks could be identified would the data be any use.

But , we can completely avoid using zsync and turn over to dat , Dat can update files too.. Then why use Zsync with Dat ?

With Dat we can even avoid embeding the update information since it has update history like git.

TheAssassin commented 6 years ago

@probonopd since we do not require http in p2p , i.e in p2p the client is also a server. Therefore we can implement rsync directly without zsync. Correct me if I'm wrong ?...

You asked to be corrected, so here it is: the zsync algorithm doesn't have anything to do with HTTP, HTTP is just one possible transport. zsync just provides a way of combining chunks of one or more files to a new file, described by a file, removing the need of a server-side component (which is how rsync actually works, there is no rsync without a remote rsync process serving as a server component).

So, that point doesn't really make sense...

probonopd commented 6 years ago

With Dat we can even avoid embeding the update information since it has update history like git.

Dat vs. ipfs - I still don't understand the differences to decide between them. Why can't they just merge... :-)