Nix and IPFS - Githubissues

vcunat commented 8 years ago

(I wanted to split this thread from https://github.com/NixOS/nix/issues/296#issuecomment-200603550 .)

Let's discuss relations with IPFS here. As I see it, mainly a decentralized way to distribute nix-stored data would be appreciated.

What we might start with

The easiest usable step might be to allow distribution of fixed-output derivations over IPFS. That are paths that already are content-addressed, typically by (truncated) sha256 over either a flat file or a tar-like dump of a directory tree; more details are in the docs. These paths are mainly used for compressed tarballs of sources. This step itself should avoid lots of problems with unstable upstream downloads, assuming we could convince enough nixers to serve their files over IPFS.

Converting hashes

One of the difficulties is that we use different kinds of hashing than in IPFS, and I don't think it would be good to require converting those many thousands of hashes in our expressions. (Note that it's infeasible to convert among those hashes unless you have the whole content.) IPFS people might best suggest how to work around this. I imagine we want to "serve" a mapping from the hashes we use to the IPFS's hashes, perhaps realized through IPNS. (I don't know details of IPFS's design, I'm afraid.) There's an advantage that one can easily verify the nix-style hash in the end after obtaining the paths in any way.

Non-fixed content

If we get that far, it shouldn't be too hard to manage distributing everything via IPFS, as for all other derivations we use something we could call indirect content addressing. To explain that, let's look at how we distribute binaries now – our binary caches. We hash the build recipe, including all its recipe dependencies, and we inspect the corresponding narinfo URL on cache.nixos.org. If our build farm has built that recipe, various information is in that file, mainly the hashes of the content of the resulting outputs of that build and crypto-signatures of them.

Note that this narinfo step just converts our problem to the previous fixed-output case, and the conversion itself seems very reminiscent of IPNS.

Deduplication

Note that nix-built stuff has significantly greater than usual potential for chunk-level deduplication. Very often we do a rebuild of a package only because something in a dependency has changed, so there are only very minor changes expected in the results, mainly just exchanging the references to runtime dependencies as their paths have changed. (In seldom occasions even lengths of the paths would change.) There's a great potential to save on that during distribution of binaries, which would be utilized by implementing the section above, and even potential in saving disk space in comparison to our way of hardlinking equal files (the next paragraph).

Saving disk space

Another use might be to actually store the files in a FS similar to what IPFS uses. That seems a little more complex and tricky thing to deploy, e.g. I'm not sure someone already trusts the implementation of the FS enough to have the whole OS running of it.

It's probably premature to speculate too much on this use ATM; I'll just write I can imagine having symlinks from /nix/store/foo to /ipfs/*, representing the locally trusted version of that path. (That's working around the problems related to making /nix/store/foo content-addressed.) Perhaps it could start as a per-path opt-in, so one could move only the less vital paths out of /nix/store itself.

I can help personally with bridging the two communities in my spare time. Not too long ago, I spent many months on researching various ways to handle "highly redundant" data, mainly from the point of view of theoretical computer science.

knupfer commented 7 years ago

Well, the question is how much redundancy is needed. The somewhat guaranteed last node would be the plain url, for example https://ftp.gnu.org/gnu/hello/hello-2.10.tar.gz. But considering that there are generous people with a lot of disk space and that there are more than 1000 nix users, I think there won't be any issue.

If we're talking only about source, I'd guess a TB.

arianvp commented 7 years ago

Are people still working on this? It sounds interesting

vcunat commented 7 years ago

I'm not aware. I've been a bit overloaded lately and thus neglecting "larger" issues.

CMCDragonkai commented 7 years ago

Yes, see https://github.com/MatrixAI/Forge-Package-Archiving

We are currently working on deep integration into ipfs (reading about libp2p). That is we need something closer to the storage than the http api. On 21/12/2016 11:29 PM, "Arian van Putten" notifications@github.com wrote:

Are people still working on this? It sounds interesting

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NixOS/nix/issues/859#issuecomment-268511394, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnHHXFPvsLes6iJ2HaJCd507PEubc2Vks5rKRujgaJpZM4H32W0 .

mguentner commented 7 years ago

Please have a look at #1167 and let me know what you think. It adds IPFS support to the binary cache generation. If a binary cache is being generated (nix copy) each .nar will be added to IPFS and the resulting hash written into the corresponding .narinfo. When retrieving the .narinfo a signed IPFSHash will be found and instead of downloading the .nar from the same cache, IPFS can be used.

vcunat commented 7 years ago

@mguentner: I wondered why you decided adding *.nar files into IPFS. I would find it much more practical to add the /nix/store/xxx subtree as it is, because that would be (almost) directly usable when mounted at /ipfs/. (The only remaining step is to add a symlink /nix/store/xxx -> /ipfs/yyy.)

mguentner commented 7 years ago

@vcunat: Currently the unixfs implementation of IPFS lacks an execute bit which is quite useful for the store which is why I opted for .nar distribution until the new unixfs implementation (using IPLD) is done. Then, IPFS contents can be symlinked/bind-mounted to the store like you describe it. However, this requires ipfs running on the system while the .nar method also works using a gateway. While the concept of a almost decentralized distribution is awesome, it requires that each instance of Nix(OS) also runs an IPFS daemon which impacts not only the memory footprint but also is a security concern among other things. Don't get me wrong, I really like the idea of using IPFS at the FS level but for some use cases this might not be the ideal choice.

Basically there are two scenarios: Scenario A:

[Machine 1] ----|
[Machine 2] ----|---------HTTP--------[ IPFS Gateway ] -------- IPFS
[Machine 3] ----|

Scenario B

[Machine 1] ----|
[Machine 2] ----|---------IPFS
[Machine 3] ----|

In A a local IPFS gateway which fetches/distributes content and local Nix(OS) machines fetch their content via this gateway using HTTP. This gateway is not necessarily a dedicated machine but can also be some form of container (e.g. nixos-container). You just need to manage IPFS on the gateway like setting storage, networking quotas and limiting resources IPFS uses (memory, CPU, IO). The distribution method should be uncompressed .nar files.

In B you need to manage IPFS on all machines with the upside that IPFS can be used at the FS level, i.e. mounting /ipfs to /nix/store.

A is better suited for laptops and servers since your machine will not start distributing files when you don't want it to. B is nice for desktops and/or machines where IO and bandwidth can be donated.

We should focus on a distribution of uncompressed nar files using IPFS and later on directly symlinking/mounting IPFS contents to /nix/store.

vcunat commented 7 years ago

I didn't realize/remember the +x problem. Thanks!
I meant that "moving to IPFS" would be per-path. In particular, I preferred to avoid having system-critical stuff on such an experimental FS.

Gateways

I really like the idea of gateways, and *.nar is a very good fit there. For now it truly seems better if most NixOS instances don't serve their /nix/store directly and instead they upload custom-built stuff to some gateway(s). People could contribute by:

providing such gateways (perhaps each with some policy about what paths are accepted and from whom);
uploading new builds and/or verifying existing ones (signed by their key; perhaps even some Hydra-like SW for this could be created).

Together this ecosystem might (soon) offer some properties that we don't have with our current solution (centralized farm + standard CDN).

mguentner commented 7 years ago

@vcunat Have a look: https://github.com/mguentner/nix-ipfs/blob/master/ipfs-gateway.nix This gateway currently accepts all requests to /ipfs while this is the original config that is whitelist-only: https://github.com/mguentner/nix-ipfs-gateway/blob/master/containers.nix The config still lacks the means to compile the whitelist in a sane way (i.e. checking for duplicates, including older hashes that are not in the latest binary cache etc.) This script could be extended for that: https://github.com/mguentner/nix-ipfs/blob/master/ipfs-mirror-push.py

mguentner commented 7 years ago

@vcunat And I really like your idea of distributing/decentralizing the actual build process. The most critical part here is the web-of-trust which is currently missing in Nix(OS). Other package managers have integrated gpg and each package is being signed by the respective maintainer (Arch comes to mind).

All this could possibly be achieved using the IPFS ecosystem. Have you looked at IPLD yet?

vcunat commented 7 years ago

Currently: our build farm signs the results and publishes that within the *.narinfo files; nix.conf then contains a list of trusted keys in binary-cache-public-keys.

With IPFS: I don't remember details from my studying IPFS anymore :-) but I remember IPNS seemed the very best fit for publishing the mapping: signing key + derivation hash -> output hashes (+ signature).

cleverca22 commented 7 years ago

my old ideas for IPFS+NIX was to store whole nar files in IPFS, and to use https://github.com/taktoa/narfuse to turn a directory of nar files into a /nix/store

then the IPFS daemon can be started/stopped, and serve the raw nar files as-is

but you would need a mapping from storepath(hash of build scripts) to IPFS path(multi-hash of nar)

main downside to this plan i had was that it had to store the entire NAR uncompressed in the IPFS system, and on the end-users systems, though normal users pay the same cost once its unpacked to /nix/store

CMCDragonkai commented 7 years ago

The mapping problem is also an issue for forge package archiving. In this case we would like to map arbitrary upstream source hashes to the ipfs path. We're hoping to do this without the need of a separate moving part. Like if there was a way to embed extra hashes into ipfs object. But is there other ways?

Ericson2314 commented 7 years ago

@CMCDragonkai https://github.com/ipld/cid would be exactly what you want I think, but that spec sadly seems to be stalled. The basic idea is allowing IPFS-links to point to more things than IPFS-objects as long as the "pointing" is via content-addressing.

cleverca22 commented 7 years ago

the original idea i had to solve the mapping problem was for hydra to multi-hash every nar, and include that into the .narinfo file, but to leave the "ipfs add" as an optional step anybody can do to contribute bandwidth

main downside is that you still need cache.nixos.org for the narinfo files, it just stops being a bandwidth issue

CMCDragonkai commented 7 years ago

The haskell code we got currently stream multihashes http resources, so that could be integrated into hydra. But cid project looks interesting, we will check it out indepth soon.

mguentner commented 7 years ago

@cleverca22 Nice idea with narfuse! That solves the problem of duplicate storage. If you leave the ipfs add step optional you still need some authority that does the mapping of .nar hashes and the IPFS hash. A user that does ipfs add still needs to inform other users that the .nar is now available using that IPFS hash.

Just an idea how it could work (the code is already finished for that, see #1167): (That is Scenario A in https://github.com/NixOS/nix/issues/859#issuecomment-269922805)

A Hydra will build a jobset (e.g. nixpkgs), create binary cache afterwards and publish the resulting IPFS Hashes to a set of initial IPFS Nodes (initial seeders in bittorrent language). These seeders will download everything from the Hydra and once this is finished, the Hydra can (in theory) stop distributing that jobset since from this moment the initial seeders and everyone else running IPFS on their Nix(OS) machine will start distributing. Have a look at this script which is a basic implementation of what I describe.

How to distribute the .narinfo files is open for debate. Either use the traditional HTTP method (a .narinfo hardly generates any traffic) or also put the information inside some IPFS/IPLD structure.

The upside of distributing using HTTP is that there is a single authority that does the mapping between .nar files and IPFS hashes and no IPFS daemon needs to be installed on the "client" side since .nar files can also be fetched using a gateway (e.g. one of the initial seeders, some local machine or the one running @ https://ipfs.io).

I am confident that IPFS could revolutionize the way we distribute things but I don't consider it mature enough to be running on every machine out there. We need to find pragmatic solutions and come up with some sort road map for Nix and IPFS. Starting to distribute .nar files using IPFS could be the first step, mapping .nar files from a mounted IPFS to /nix/store the second step, making all sources (fetchgit, fetchFromGithub) available through IPFS the third (what @knupfer started) and the utopia of building everything from IPFS to IPFS the last one. :rocket: :arrow_right: :new_moon:

nbp commented 7 years ago

A part from the fact that /nix/store can contain files which are not safe for sharing, because of other issues. I want to raise security concerns to any P2P & Nix integration.

The biggest issue here is how to guarantee the anonymity of both peers. To highlight the issues let suppose we have 2 peers, Alice (A) and Bob (B) as usual, and that A request one package P to B.

B sees which version of P is requested, and knows the IP of A. Thus can deduce that A does not yet have P. If A does not have P, this means that either A is installing it for the first time, or upgrading it P. In which case, B can try to attack A with the issues fixed in the latest version of P.
A sees if P is available, and knows the IP of B. If newer version of P are not available on B, then either P is no longer used in B's configuration or P is not yet upgraded. In which case, A can try to attack B with the issues fixed in the latest version of P which is not available on B.

In both cases we might think that both issues can be avoided by faking the fact that we have or not a package P, by forwarding the content of someone else. But this suffer from timing attack, and might increase the DoS surface.

What these examples highlight is that we need to either trust the peers, or that we need to provide anonymity between the peers, such that nor A, nor B knows the IP of the others.

mguentner commented 7 years ago

@nbp Thanks for mentioning this!

That is very true and will be something that needs to be addressed once it makes sense to run a IPFS daemon on the same system that requests the /nix/store paths using IPFS (as nar or directly pinning them).

For now IPFS itself is the biggest security concern on a system and then the information about the system it potentially leaks.

However, currently every NixOS user who uses https://cache.nixos.org leaks information about the installed versions to a central entity (Amazon Cloudfront) and all systems in between (filesize).

Depends on your scenario but running a local IPFS gateway might even improve security by reducing the ability to fingerprint your system since many Nix installtions potentially share this gateway. But that's just guesswork plus the security is based partly on obscurity :)

cleverca22 commented 7 years ago

@nbp another factor to consider, is that IPFS will advertise the multi-hash of every object you can serve

even if you never advertise locally built things with secrets like users-groups.json, you are still going to advertise you have a copy of hello-2.10 built with a given nixpkgs, and then an attacker could make use of that

knupfer commented 7 years ago

You could serve only store paths which could be garbage collected. So you'll only leak information when you download from ipfs, but not by serving.

cleverca22 commented 7 years ago

but now you will never contribute bandwidth towards current build-products, only out of date things

knupfer commented 7 years ago

Or build-products which you've deinstalled, or brought only via nix-shell into your system

cleverca22 commented 7 years ago

yeah, that would limit its use-fullness while giving security, feels more like something the end-user should decide on via a config option

knupfer commented 7 years ago

Agree. Don't forget that newer version of sources have normally a lot of untouched files, so it would even help with old garbage (this is obviously not so often with binaries).

cleverca22 commented 7 years ago

main issue i can spot with adding raw uncompressed NAR's to the IPFS network is the lack of compression, and lack of file-level dedup within the NAR, but the IPLD stuff i've heard about could add the NAR in file sized chunks, inspecting the contents of the NAR as it goes, at the cost of having a different hash from plain "ipfs add"

vcunat commented 7 years ago

I think you do get file-level dedup within and across NARs, as IPFS is supposed to do chunking based on content IIRC.

Ericson2314 commented 7 years ago

@copumpkin's comment https://github.com/NixOS/nix/issues/520#issuecomment-275666718 sketching a possible implementation of non-deterministic dependencies shares a lot of characteristics with IPNS.

equalunique commented 7 years ago

Will this IPFS enhancement help the Nix community overcome another AWS S3 outage? (Like the one which just happened recently)

CMCDragonkai commented 7 years ago

As long as the ipfs nodes has their content hosted outside s3. On 10/03/2017 3:27 PM, "Evan Rowley" notifications@github.com wrote:

Will this IPFS enhancement help the Nix community overcome another AWS S3 outage? (Like the one which just happened recently)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/NixOS/nix/issues/859#issuecomment-285574769, or mute the thread https://github.com/notifications/unsubscribe-auth/AAnHHXnrl06rtsU61dX5-X1SkWE2Vq9Bks5rkNE1gaJpZM4H32W0 .

CMCDragonkai commented 6 years ago

I'm wondering if the new nix 2.0 store abstraction would help and adding an IPFS store.

vcunat commented 6 years ago

For reference, the experiments around https://github.com/NixIPFS found that IPFS isn't able to offer reasonable performance for the CDN part, at least not yet.

CMCDragonkai commented 6 years ago

Are there benchmarks?

On 5 May 2018 23:27:50 GMT+10:00, "Vladimír Čunát" notifications@github.com wrote:

For reference, the experiments around https://github.com/NixIPFS found that IPFS isn't able to offer reasonable performance for the CDN part, at least not yet.

-- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: https://github.com/NixOS/nix/issues/859#issuecomment-386805735

-- Sent from my Android device with K-9 Mail. Please excuse my brevity.

vcunat commented 6 years ago

I don't remember any definite results, except that it wasn't usable. @mguentner might remember more.

mguentner commented 6 years ago

@CMCDragonkai No runnable benchmark, just personal experience. Here you can read about the last deployment:

https://github.com/NixIPFS/infrastructure/blob/master/ipfs_mirror/logbook_20170311.txt

I have no idea how IPFS behaves currently, but I assume that the DHT management traffic is still a problem. Without a DHT you have to manually connect the instances. Please note that IPFS itself works fine for smaller datasets (<= 1 GiB) but does not compare well against old-timers like rsync (which we used in a second deployment of nixipfs-scripts).

davidak commented 6 years ago

@whyrusleeping is aware of these things

He wrote in some issue at the end of 2017:

In general, with each update we've had improvements that reduce bandwidth consumption.

So it might be already "usable" for this use case?

It is still not fixed completely. Here are some related issues to follow.

https://github.com/ipfs/go-ipfs/issues/2828 https://github.com/ipfs/go-ipfs/issues/3429 https://github.com/ipfs/go-ipfs/issues/3065

parkan commented 5 years ago

would love to revive this, anyone on the nix side actively involved as of now?

davidak commented 5 years ago

@parkan i don't think so. The linked ipfs issues in my last comment are still open, so we have to wait for fixes (or get involved there and help resolve them).

vcunat commented 5 years ago

@parkan: as written, there were severe performance problems with IPFS for our use case. I haven't heard of them being addressed, but I haven't been watching IPFS changes...

parkan commented 5 years ago

gotcha, thanks for the TLDR 😄

there's ongoing work on improving DHT performance, but the most effective approach will likely involve non-DHT based content routing -- I'll review the work in @NixIPFS to see if there's anything obvious we can do today

are there stats on things like total number of installed machines, cached binaries, etc somewhere?

vcunat commented 5 years ago

@parkan: there's a list of binary packages for a single snapshot (~70k of them). We have that amount roughly thrice at a single moment (stable + unstable + staging), and we probably don't want to abandon older snapshots before a few weeks/months have passed, though subsequent snapshots will usually share most of the count (and size). Overall I'd guess it might be on the order of hundreds of gigabytes of data to keep up at once (maybe a couple terabytes, I don't know).

I suppose the publishing rate of new data in GB/day would be interesting for this purpose (DHT write throughput), but I don't know how to get that data easily. And also the "download" traffic: I expect there will be large amounts, given that a single system update can easily cause hundreds of MB in downloads from the CDN, and github shows roughly a thousand of unique visitors to the repo each day (even though by default you download source snapshots from the CDN directly instead of git).

I'm sure I did see some stats on a NixCon, but I can't find them and they might be double nowadays. @AmineChikhaoui: any idea if it's easy to get similar stats from Amazon, or who could know/do that?

mguentner commented 5 years ago

@parkan

total number of installed machines: 0 (once 3)
cached binaries: 0 (once ~ 400 GiB, rougly 40 jobsets of nixpkgs)

The project is dead at the moment because no one showed interested. I decided that I won't force something if the community is happy with the AW$ solution.

The @NixIPFS project was also an attempt to free the NixOS project of the AW$ dependency which seemed really silly and naive to me.

Since a simple rsync mirror already fulfills that requirement I went ahead with that. However I found nobody who wanted to commit themselves with server(s) and time. The idea would have been a setup with mirrorbits (redudant with redis sentinel) and optional geo dns. Old issue

Ping me if you need assistance.

Warbo commented 5 years ago

I appreciate the scaling issues with serving NARs, etc. over IPFS, but it looks like this "full-blown" approach has derailed the orthogonal issue of making external sources more reliable (described under "What we might start with" in the first comment).

I've certainly encountered things like URLs and repos disappearing (e.g. disappearing TeXLive packages, people deleting their GitHub repos/accounts after the Microsoft aquisition, etc.), which has required me to search the Web for the new location (if it even exists elsewhere...) and alter "finished" projects to point at these new locations. This is especially frustrating for things like reproducible scientific experiments, where experimental results are tagged with a particular revision of the code, but that revision no longer works (even with everything pinned) due to the old URLs.

As far as I see it there are two problems that look like low hanging fruit:

The first is to make a fetchFromIPFS function which doesn't require hardcoding a HTTP gateway. This could be as simple as e.g.

fetchFromIPFS = { contentHash, sha256 }: fetchurl {
  inherit sha256;
  url = "https://ipfs.io/ipfs/${contentHash}";
}

This prevents having HTTP gateways scattered all over Nix files, and allows a future implementation to e.g. look for a local IPFS node, which would (a) remove the gateway dependency, (b) use the local IPFS cache and (c) have access to private nodes e.g. on a LAN.

The second issue is that personally, I would like to use a set of sources, a bit like metalink files or magnet links. The reason is that upstream HTTP URLs might be unreliable, but so might IPFS! At the moment, fixed-output derivations offer a false dichotomy: we must trust one source (except for the hardcoded mirrors.nix), so we can either hope that upstream works or force ourselves to reliably host things forever (whether through IPFS or otherwise). Whilst I don't trust upstreams to not disappear, I trust my own hosting ability even less!

I'm not sure how this would work internally, but I would love the ability to say e.g.

src = fetchAny [
  (fetchFromIPFS { inherit sha256; contentHash = "abc"; })
  (fetchurl { inherit sha256; url = http://example.org/library-1.0.tar.lz; })
  (fetchurl { inherit sha256; url = http://chriswarbo.net/files/library-1.0.tar.lz; })
];

The same goes for other fetching mechanisms too, e.g.

src = fetchAny [
  (fetchFromGitHub { inherit rev sha256; owner = "Warbo"; repo = "..."; })
  (fetchgit { inherit rev sha256; url = http://chriswarbo.net/git/...; })
  (fetchFromIPFS { inherit sha256; contentHash = "..."; })  # I also mirror repos to IPFS/IPNS
];

Whilst all of the hash conversion, Hydra integration, etc. discussed in this thread would be nice; simple mechanisms like the above would be a great help to me, at least. I could have a go at writing them myself, if there was concensus that I'm not barking up the wrong tree? ;)

vcunat commented 5 years ago

I don't think it's orthogonal at all. Sources are cached in the CDN as well. (Once in a longer while IIRC.) EDIT: maybe only fetchurl-based sources ATM, I think, but that's vast majority and not a blocker anyway, as it's only store paths again. Current example: https://github.com/NixOS/nixpkgs/pull/46202

vcunat commented 5 years ago

I must admit it's difficult to compete with these CDNs, as long as someone pays/donates them. My systems commonly update with 100 Mb/s, reply < 5 ms. I'm convinced this WIP has taken lots of effort to get into this stage, but to make it close to the CDN, that would surely take many times more. I personally am "interested" in this, but it's a matter of priorities, and I've been overloaded with nix stuff that works much worse than the content distribution...

Warbo commented 5 years ago

@vcunat Just to be clear (can't tell if you were replying to me or not) my thoughts above were mostly concerned with custom packages (of which I have a lot ;) ) which have no CDN, etc. rather than "official" things from nixpkgs.

vcunat commented 5 years ago

OK, in this whole issue I've been only considering sources used in the official nixpkgs repository plus the binaries generated from that by hydra.nixos.org. Ability to seamlessly go above that would be nice, but it feels like overstretching my wishlist.

Warbo commented 5 years ago

Whoops, never mind; it looks like https://github.com/NixOS/nixpkgs/tree/master/pkgs/build-support/fetchipfs basically does what I described (fetch from a local IPFS node, with a HTTP fallback)!

CMCDragonkai commented 5 years ago

Just a note, proprietary sources are not cached in the CDN. And these tend to break the most I find. In one instance the source link is not even encoded in nixpkgs (cuDNN) and you're expected to login to NVIDIA to get them. I did however find automatable link for acquiring cuDNN.

My original goal here was to have transparent ipfs fetching. So you dont need to special case the fetches, it just works reproducibly as the first time a fetch is applied, it gets put into an IPFS node.

AmineChikhaoui commented 5 years ago

@vcunat I think @edolstra has a script that generates few stats/graphs from the binary cache, if I'm not wrong the latest was shared in https://nixos.org/~eelco/cache-stats/, I believe it should be possible to generate that again. Is that what you're looking for ?

NixOS / nix

Nix and IPFS #859

What we might start with

Converting hashes

Non-fixed content

Deduplication

Saving disk space

Gateways