AppImageCommunity / zsync2

Rewrite of https://github.com/AppImage/zsync-curl, using modern C++, providing both a library and standalone tools.
Other
132 stars 25 forks source link

Using bittorrent protocol with web seeding for possible P2P support for zsync2 #24

Closed antony-jr closed 6 years ago

antony-jr commented 6 years ago

conversation moved from #22 , To stay on topic.

antony-jr commented 6 years ago

@TheAssassin Kademlia DHT complexity is O(log(n))(where n is the number of nodes) , Thus the performance should not be that bad.

TheAssassin commented 6 years ago

I doubt the entire complexity of a distributed hash table can be described with a single Big-O formula.

Anyway, the problem I meant is the loss of nodes in the DHT. In the scenario you described, nodes get lost after finishing their update (this is referred to as leeching). Every time this happens, a part of the DHT gets lost, and must be rebuilt and synced from the other nodes. This is fairly inefficient. Also, the ratio of data fetched from peers can't be very high due to that.

IPFS, Dat etc. are run as external services on the host, i.e., they run permanently, not just temporarily. This allows for building a real distributed network that is both efficient and a lot easier to use for the client (no torrent files, but simply an IPFS URI that needs to be known, which can be generated with a few simple steps).

antony-jr commented 6 years ago

Anyway, the problem I meant is the loss of nodes in the DHT. In the scenario you described, nodes get lost after finishing their update (this is referred to as leeching). Every time this happens, a part of the DHT gets lost, and must be rebuilt and synced from the other nodes.

@TheAssassin Yes that needs to taken into the account , But I will find a solution for this.(There is a method to solve this overhead , Not quite sure where I read it) And also you need to know that we have a stable HTTP(s) webserver which can back the swarm when the torrent is not popular keeping it alive until the HTTP(s) webserver is alive , So we don't really need to worry about leechers.

IPFS, Dat etc. are run as external services on the host, i.e., they run permanently, not just temporarily. This allows for building a real distributed network that is both efficient and a lot easier to use for the client (no torrent files, but simply an IPFS URI that needs to be known, which can be generated with a few simple steps).

Even IPFS uses DHT to find the peers and so I think using the bare minimum will just give us more performance where in case of IPFS those things are so much sugarcoated.

Anyways I will ping you when I have a working experiment that you can test on and see how it performs ( at best case and worst case ).

TheAssassin commented 6 years ago

And also you need to know that we have a stable HTTP(s) webserver which can back the swarm when the torrent is not popular keeping it alive until the HTTP(s) webserver is alive , So we don't really need to worry about leechers.

This statement is pretty contradictory. I think you meant to keep alive the torrent until new peers connect.

Even IPFS uses DHT to find the peers and so I think using the bare minimum will just give us more performance where in case of IPFS those things are so much sugarcoated.

That's not the point. What I mean is that there's an ipfs daemon, in contrary to a "bittorrent daemon" which is alive in the background even after an update has finished.

Just because they use DHT doesn't mean it's exactly the same as bittorrent. There's fundamental differences, which make it much more suitable for our needs.

antony-jr commented 6 years ago

This statement is pretty contradictory. I think you meant to keep alive the torrent until new peers connect.

I'm talking about web seeding here , With web seeding we don't really need to worry about seeders , When the content is not active the web server serves it and if its popular it uses the swarm , Its pretty slick compared to IPFS.

That's not the point. What I mean is that there's an ipfs daemon, in contrary to a "bittorrent daemon" which is alive in the background even after an update has finished.

Just because they use DHT doesn't mean it's exactly the same as bittorrent. There's fundamental differences, which make it much more suitable for our needs.

They are very similar to bittorrent , checkout Filecoin ( a service which pays you to keep a file in your system for money which helps the IPFS to make the files available on worst case scenario , Aka just like seeding for money in bittorrent) and its built on top of IPFS by the same organization. What I'm trying to say is that you need to have seeders to download files from IPFS just like bittorrent , except IPFS invented Filecoin to make the seeding part a business trade just like how bitcoin gives money for miners. I don't know anything about IPFS and web seeding(I think they don't have one) , Point if this feature is available.

And read on WebTorrent ( The technology is really interesting and introduces the power of web seeding , You can see the live demo there and how fast it is)

antony-jr commented 6 years ago

I'm choosing the bittorrent protocol over IPFS for the novelty that bittorrent provides on how it can be used on top of our current model and always make best case out of our current model(client - server) by balancing the load when there is too much peers.

TheAssassin commented 6 years ago

Sure, go ahead, I guess at this point we need to see some example code.

probonopd commented 6 years ago

Looking forward to seeing something we could test.

antony-jr commented 6 years ago

As @TheAssassin said , A torrent cannot be completely 'Trackerless' , Its obvious that its virtually impossible to detect peers without a central server... But The 'Tracker' will only be used for bootstrapping(bootstrapping node) the network and then the tracker is not used (And it becomes 'Trackerless' ). The good part is that these torrent trackers are cheap and freely available (Yes it is legal , We can use archive.org , It provides legal torrents and so we will be good[Its free]). We will not overload the Tracker since we will only use it to bootstrap( Most torrent clients automatically does this.. ). So any comments on this ?

In my opinion this is far better than using IPFS and paying for Filecoin to keep our files alive in the swarm ( not to mention running a daemon in the background ). It just works like any other torrent.... And its all free ( We have a win-win situation here , even if the trackers goes into the void in the near future we can just buy a cheap server from digital ocean or use a free one from heroku [ Since the tracker is just a bootstrap node ] ) , But I'm pretty sure that archive.org will not go away that soon since its been a long time.

Even if all trackers fail , It automatically resorts to the HTTP source and downloads directly from the server.(This does not happen unless the tracker was targeted intentionally.. )

probonopd commented 6 years ago

Can we please make a comparison regarding the differences, pros/cons of at least

Bittorrent is associated in public perception with IP infringements, something the other two don't currently seem to be suffering from. Also, the latter two appear to be better suitable for distributed databases.

TheAssassin commented 6 years ago

In my opinion this is far better than using IPFS and paying for Filecoin to keep our files alive

You probably don't know, but this won't be an issue at all. There is an AppImage fan base in China who'd like to provide free IPFS mirroring for AppImages. Also, I am pretty sure you don't need to pay for IPFS traffic if you have your own peers serving the files.

While talking to @probonopd earlier, I also remembered another reason I don't like Bittorrent for. As soon as trackers are involved, users' privacy is broken. This is by design. Anybody can basically track who's downloading which files. Other P2P systems have solutions to avoid such problems which make it way harder to gather such metadata. Privacy is a huge point to me.

The daemon in the background is IMO not a problem but actually a big advantage, especially for initial distribution. By sharing files over such a daemon, they're automatically seeded. This means that zsync2 could directly start seeding files in the P2P network, and all it does have to do is interact with the IPFS tools. We don't have to manage anything, we just need to tell IPFS to serve the files and get the URI in for the file the IPFS network. We can then put this URI into the .zsync file. This would then be a true P2P solution, without any hacks like web seeding.

I'm not sure what you mean with "legal torrents". Torrents aren't illegal at all per se. And regarding the infrastructure, I could just provide a tracker if I wanted to, or we could rent a (really cheap) Hetzner cloud server (see https://www.hetzner.com/cloud?country=us). Might come in handy for some experiments even.

Oh, today I've been browsing IEEE Xplore a bit, to get the full text of the eBay paper, which I'll be reading later. I also found a couple of other related papers. If there's anything interesting in them, I'll let you know.

probonopd commented 6 years ago

I'm not sure what you mean with "legal torrents". Torrents aren't illegal at all per se.

I think he is referring to the fact that the Bittorrent protocol is associated in public perception with IP infringements.

TheAssassin commented 6 years ago

That's not IP but copyright. Damn, I always think of "Internet Protocol" when someone talks about intellectual property....

probonopd commented 6 years ago

Intellectual property rights include Copyright ;-)

TheAssassin commented 6 years ago

IP includes copyright, yes, but also a lot of other things like patents, design rights, trademarks etc.

From all these rights, it's only the copyright which a natural person needs to care about. And it is what companies try to enforce when talking about torrents.

So, let's rather say "copyright infringement". It's a lot more precise.

antony-jr commented 6 years ago

@probonopd Here is a short summary ,

Bittorrent/DHT

Nowadays bittorrent clients use DHT by default to make the torrents 'Trackerless' and thus making it truly decentralised. But a Tracker is still needed as a bootstrapping node (i.e To initially discover the peers and join in the swarm after that the client does not need the tracker). Obviously its virtually impossible to join the DHT without even knowing a single peer and so a tracker is used here.

This is not new and its already supported by a lot of torrent libraries(e.g: libtorrent-rasterbar) and Thus it makes the job easier at programmatic level.

Web seeding is fairly cheap , You just have to append the web source to the tracker list. Since web seeding is supported by all torrent libraries , This part is also done.

All we have to do is to combine zsync with libtorrent-rasterbar.

Pros and cons of bittorrent still exist here since we do not really do any core changes.

Example of bittorrent + webseeding is used by webtorrent to stream a video as a demo on their site.(Webtorrent is the implementation of bittorrent on the web. , Its open source too. )

libp2p/DAT

DAT is fairly similar to bittorrent(libp2p and DAT does not make any sense , i.e it is already peer-to-peer). DAT made the idea more secure by using strong hashes for the files and also add versioning just like git. DAT also uses DHT.

Dat links are Ed25519 public keys which have a length of 32 bytes. And for the security part , the links are not made public.. But if you want to access the content you have to have the public key. During peer discovery only the BLAKE2b hash of the public key is used.

For peer discovery(i.e bootstrapping ) , DAT uses DNS-SD and DNS-Multicast , After bootstrapping it uses the DHT just like the bittorrent protocol.

I don't know if DAT supports web seeding.

IPFS

IPFS is fairly similar to DAT, It adds security and versioning just like DAT but the only difference is that it totally eliminates the use of peer discovery through DNS-SD and DNS-Multicast or Trackers. Instead of peer discovery through Trackers and DNS , IPFS blindly depends on the DHT that is kept alive by the swarm(You need to know a single stable peer to connect to and join the swarm)... SO ONCE THE FILE IS NOT SEEDED BY ANY PEERS , THE FILE GOES INTO THE VOID

Thats why filecoin was invented as the second stage to their vision of a truly decentralised web , Using filecoin , users can get money for seeding on a file. Just like bitcoin miners making bitcoin more secure by mining. ( They said this in their video on youtube: https://www.youtube.com/watch?v=EClPAFPeXIQ )

I know IPFS got a lot of things but we don't really need those fancy stuff.

EDIT : Filecoin is very similar to Siacoin which does the same to cloud storage. If you read the youtube comments on video about filecoin , Users are already disappointed about this idea...

You probably don't know, but this won't be an issue at all. There is an AppImage fan base in China who'd like to provide free IPFS mirroring for AppImages. Also, I am pretty sure you don't need to pay for IPFS traffic if you have your own peers serving the files.

@TheAssassin We can't depend on fans , And they can't hold all the appimages on their PC's , The seeding as to be done on per file basis and so small developers have to seed the appimages themselves. I think I would rather request them to host a private bittorrent tracker , It won't be that costly and there won't be any load on the tracker since its just used as a bootstrapping node.

While talking to @probonopd earlier, I also remembered another reason I don't like Bittorrent for. As soon as trackers are involved, users' privacy is broken. This is by design. Anybody can basically track who's downloading which files. Other P2P systems have solutions to avoid such problems which make it way harder to gather such metadata. Privacy is a huge point to me.

Everybody cares about privacy , Using private trackers can give some privacy but if you want some serious security then we can just use the same solution that DAT uses , i.e Using DNS for peer discovery and then use DHT. (I've been having this idea to use DNS but Trackers seemed more cheap and freely available )

The daemon in the background is IMO not a problem but actually a big advantage, especially for initial distribution. By sharing files over such a daemon, they're automatically seeded. This means that zsync2 could directly start seeding files in the P2P network, and all it does have to do is interact with the IPFS tools. We don't have to manage anything, we just need to tell IPFS to serve the files and get the URI in for the file the IPFS network. We can then put this URI into the .zsync file. This would then be a true P2P solution, without any hacks like web seeding.

I don't know really if this is an advantage , the files are cached in the users computer and bloats some space ( This all happens without any notification to the user , Once he downloads something its cached making the IPFS more stable , it only benefits IPFS). And we can't really depend on this , once the nodes having this file goes offline the file goes into the void with them , its exactly similar to a normal torrent without any seeds in the swarm. I would not call web seeding a hack since it is very stable than the idea of IPFS... Web-Seeding is a BitTorrent Enhancement Proposal 19(see BEP 19 for the full design of web seeding) and it will be standardized soon.

I'm not sure what you mean with "legal torrents". Torrents aren't illegal at all per se. And regarding the infrastructure, I could just provide a tracker if I wanted to, or we could rent a (really cheap) Hetzner cloud server (see https://www.hetzner.com/cloud?country=us). Might come in handy for some experiments even.

I ment the legal side because there are so many illegal free bittorrent trackers such as thepiratebay , isohunt and etc... (But archive.org is the legal among them so we can recommend users to use it )

Oh, today I've been browsing IEEE Xplore a bit, to get the full text of the eBay paper, which I'll be reading later. I also found a couple of other related papers. If there's anything interesting in them, I'll let you know.

I won't stop you but if the paper costs you something , I will not recommend you buying it since it does not explain much about our idea. I would love to see some related papers.

Conclusion

I think bittorrent protocol fits into all the things we need compared the other two alternatives , Since bittorrent protocol is not secure we can create a hybrid with the bittorrent protocol with DNS as the peer discovery method instead of trackers (Which I plan to do , Webtorrent is a hybrid too)

In our situation we don't need versioning or a filesystem , we just need to balance the load on the server which bittorrent and web seeding fits in correctly.

IPFS is too good to be true , It requires the nodes to seed all the time and if a node which has the file disappears(including all other nodes who has the copy of the file) , the file goes into the void with them. One way to solve this is to download the file from a HTTP server and then re-introduce it to the swarm which requires you to download the entire file(Thanks to content integrity checks) , even after that the file disappears when the file is not popular.

DAT might be a good alternative but it also introduces new things which are not novel , it cannot depend on a HTTP source which is what we aim to.

In conclusion , We need to balance the load on ther server and nothing else , DAT and IPFS gives more than what we want and thus makes it invalid. Using web seeding (BEP 19) and the bittorrent protocol we can use both the peers and the web server which is novel compared to DAT and IPFS. For security concerning the bittorrent protocol can be solved by creating a new hybrid that uses DNS-SD and DNS-Multicast for peer discovery. Also using zsync with bittorrent protocol does improve the security by checking the blocks with a checksum from a trusted source such as a central web server which hosts the zsync metafile. Thus to the best of my knowledge a hybrid of bittorrent and zsync algorithm should solve our problem.

probonopd commented 6 years ago

cc @pfrazee - interested in your thoughts regarding the above.

DAT uses DNS-SD and DNS-Multicast

Isn't that alone a huge plus? Imagine a school or university, where everyone wants to have the latest OpenOffice AppImage on the day it is released. One user downloads it from the web. All other users on the LAN will discover, using DNS-SD and DNS-Multicast, that they can get it from a local peer on the LAN rather than from the slow and costly WAN.

pfrazee commented 6 years ago

@probonopd Yeah the discovery-code is in a bit of flux right now. We stopped using the Bittorrent DHT because results were just awful, but we're going to replace it with our own DHT. We currently use a tracker server that uses some custom DNS messages (also temporary, will be replaced with the DHT). The one permanent thing is our use of multicast DNS, which yes has the benefit you describe.

Regarding the seeding reliability, we have "public peer" software (homebase, hashbase) and a spec for interacting with those with a standardized API (pinning API). Public peer services are the solution we're using for Dat uptime, rather than a coin like IPFS.

probonopd commented 6 years ago

Thanks @pfrazee for chiming in here. Somehow my gut feeling seems to tell me that DAT is more suitable than Bittorrent for distributing AppImages but I haven't really understood the different p2p technologies well enough yet to really nail it down in a couple of bullet points.

antony-jr commented 6 years ago

@probonopd I would definitely recommend DAT if it supports web seeding. (its the only missing piece )

I think it will give DAT a huge boost.

antony-jr commented 6 years ago

@probonopd On the other hand if you want to try out bittorrent and webseeding in action try this , I created a torrent for the Insight GDB Debugger AppImage. Currently it does not have any seeds(I'm not seeding too.) but you can still get the file and if any other peers are detected the webserver is not used. The github source is used for web seeding(https://github.com/antony-jr/insight/releases/download/continuous/Insight-continuous-x86_64.AppImage).

Make sure to use qbittorrent or any torrent client which supports web seeding.

The torrent file is directly published in the releases(Yes it does use a public tracker) -> https://github.com/antony-jr/insight/releases/download/continuous/Insight-continuous-x86_64.AppImage.torrent

EDIT : This is so cheap that we can embed this into our current system.

probonopd commented 6 years ago

Why is web seeding important if we embed the p2p mechanism in, say, appimaged or app center apps?

TheAssassin commented 6 years ago

It's quite obvious that first of all, not everyone wants to be permanently seeding AppImages a.k.a. listing all their AppImages in a public location everybody can query (data protection etc.), and second there's times in which nobody might seed (or only some people with a bad connection who might then no longer be able to use the Internet).

antony-jr commented 6 years ago

It's quite obvious that first of all, not everyone wants to be permanently seeding AppImages a.k.a. listing all their AppImages in a public location everybody can query (data protection etc.), and second there's times in which nobody might seed (or only some people with a bad connection who might then no longer be able to use the Internet).

@TheAssassin Just took the words out of my mouth.

@probonopd You should know that anyone can create a new protocol-x but can it exist alongside our current model , which is efficient , Yes it does have a single failure point but its the worst case scenario which does not happen that often compared to a seed-less p2p network which occurs quite often. Even the new protocols(Dat and IPFS ) uses HTTP protocol to connect to the entire world which implies that these protocols are not used widely as HTTP. Even their downloads are through HTTP and they don't have a single mechanism to work alongside the current model. Combine these two protocols together we get a new protocol which can support each others quirks.

In conclusion we just need to make the bittorrent protocol more secure and thats not very hard.

EDIT: We can also use DAT and introduce web-seeding as a DEP.

pfrazee commented 6 years ago

Even the new protocols(Dat and IPFS ) uses HTTP protocol to connect to the entire world which implies that these protocols are not used widely as HTTP.

I mean, we use HTTP only for the Pinning API, and we do that because it's the best fit for that use-case. It's also true that Dat is less widely used than HTTP, but there's no causal relationship there.

I'm fine with whatever you choose to do, but there's no implication in our use of HTTP. Dat doesn't replace HTTP for all use-cases and doesn't try to.

TheAssassin commented 6 years ago

I don't think it's much of an issue to require people to set up servers running Dat to keep their files alive. The only advantage of bittorrent is that someone wrote a web seeding protocol that fetches blocks with range requests. Other than that, I believe Dat is a much superior solution.

But after all it's to the person who implements the feature to make design decisions. @antony-jr wants to implement a Bittorrent-based approach, so we should take this serious, wait for him to provide us with some code, then discuss this code and, if it works for us, merge it eventually. We'll see whether people will use it then. We just have to make sure to inform users about the advantages and disadvantages, as well as the privacy related issues of Bittorrent.

antony-jr commented 6 years ago

@TheAssassin Lets see how the end result comes out , I'm so sure that the new hybrid can take down our current problem without compromising security.

TheAssassin commented 6 years ago

This is not a security but a privacy issue, @antony-jr. Don't mix those two terms up. The flaw is in the Bittorrent design. I don't think you can fix it client-side.

antony-jr commented 6 years ago

The flaw is in the Bittorrent design. I don't think you can fix it client-side.

Who said that the change is only on the client-side ? You will see once its done...

TheAssassin commented 6 years ago

Because, per definition, it's a peer-to-peer network, i.e., there's only clients to the network (aside from the HTTP webseed stuff). As soon as somebody connects to a tracker, their privacy is compromised. This is, however, a necessary step. You could perhaps improve some things if you'd suggest to host our own tracker, in tracker-less (DHT) mode, that'd mean if we don't share IPs, at least the tracker won't leak data (well, as secure as it can get). But the DHT is in the end only a list of IPs associated with specific files. If I would get to downloading the entire DHT for a single torrent, I could easily tell who's at the moment in possession of which file. Repeat that for every torrent that is out there, and you can probably generate a list of AppImages a user has on their computer. Especially when @probonopd continues with his idea to put this into appimaged.

antony-jr commented 6 years ago

You are free to do anything you like(And Dat is a good solution if you want to go full blown P2P , That is a perfect solution if you are going to embed it in a daemon.) , I just proposed a possible way and it might take a long time since as I said before it worths a scientific paper. So until then I like to keep this issue low. :+1: I was just thinking about what we can do and the progress is still 0%. I would like you to find other possibilities too... I will reopen this when the project is stable enough for AppImage.

Thank you for your time :hourglass: :heart: !

probonopd commented 6 years ago

@antony-jr if you have some spare time, why not experiment a bit with the technologies and make some proof-of-concepts... would be awesome!

antony-jr commented 6 years ago

I will try my best to do that...

antony-jr commented 4 years ago

CC: @probonopd @TheAssassin

As discussed earlier at IRC, here is the POC on updating AppImages via Bittorrent protocol -> https://youtu.be/Sqq3AeiFh3U

The GUI application shown in the video is only made for testing.

In the video I update an outdated AppImage. The new version has a .torrent file so the updater updates(with permission from user) with bittorrent. I used qBittorrent to demonstrate that we are indeed sharing data with peers.

As you can see as the updater uses the zsync algo, the data synced from the old file is also shared to all peers. You can see it in the video that, qBittorrent downloads 95% of the file when we start getting required blocks with Bittorrent.

You can also see in the video that we only seed as long as we download the update.

qBittorrent will not able to download the AppImage because the torrent file uploaded will not contain web seeds, Only the updater finds the link to the target file as mentioned in the zsync file and uses it as the web seed. So once a peer got data, it shares to other peers and thus qBittorrent was able to download when the updater starts.

Link to 2.x release (Pre-Alpha): https://github.com/antony-jr/AppImageUpdater/releases/tag/continuous-2.x

Link to Test AppImage(Old version): https://github.com/antony-jr/ShareMyHost/releases/tag/1

The new version has the torrent file uploaded to the releases. If you want to test it out like I did in the video then here is the link to the torrent file of the latest version of the test appimage -> https://github.com/antony-jr/ShareMyHost/releases/download/continuous/ShareMyHost-a0fb13b-x86_64.AppImage.torrent

probonopd commented 4 years ago

This is very exciting @antony-jr. When I tried, it asked me about Bittorrent, I said yes, but then got

image.

I am not sure whether it was actually downloading using Bittorrent or not. It would be nice if one could see this during the download process.

Update: When I tried a second time, it worked.

Conceptual question: If users are seeding only during the short period of time when they are updating, will there be enough seeders? I would assume that for this to add real benefit, the updater would have to keep seeding for some time (with the permission of the user) after the update has completed?

antony-jr commented 4 years ago

I am not sure whether it was actually downloading using Bittorrent or not. It would be nice if one could see this during the download process.

It's a test GUI, there will be improvements later. Let's only talk about the Bittorrent usage to stay in topic with the issue.

If users are seeding only during the short period of time when they are updating, will there be enough seeders? I would assume that for this to add real benefit, the updater would have to keep seeding for some time (with the permission of the user) after the update has completed?

If you have a lot of users updating then I would say there would be enough seeders. Even if there are no seeders we use the http target file as web seeds. If you really want the users to have the ability to seed then Yes I think we can come up with something.

I posted the results in this issue to finally show that decentralized update is economically possible using the Bittorrent protocol, We can also easily integrate this into our current ecosystem. It also does not break the current zsync update format.

Also see https://github.com/antony-jr/MakeAppImageTorrent which is used to create torrent files to upload on new releases.

probonopd commented 3 years ago

Hi @antony-jr I think this is great, and, as long as it is opt-in, see wide usage.

Why opt-in?

  1. Not all AppImages may contain freely shareable open source software
  2. Not all users may want to use their bandwidth (e.g., mobile users not having a flatrate), expose their IP address, and/or reveal which AppImages they have

How can we best make it widely used with opt-in?