NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.08k stars 1.47k forks source link

Nix and IPFS #859

Open vcunat opened 8 years ago

vcunat commented 8 years ago

(I wanted to split this thread from https://github.com/NixOS/nix/issues/296#issuecomment-200603550 .)

Let's discuss relations with IPFS here. As I see it, mainly a decentralized way to distribute nix-stored data would be appreciated.

What we might start with

The easiest usable step might be to allow distribution of fixed-output derivations over IPFS. That are paths that already are content-addressed, typically by (truncated) sha256 over either a flat file or a tar-like dump of a directory tree; more details are in the docs. These paths are mainly used for compressed tarballs of sources. This step itself should avoid lots of problems with unstable upstream downloads, assuming we could convince enough nixers to serve their files over IPFS.

Converting hashes

One of the difficulties is that we use different kinds of hashing than in IPFS, and I don't think it would be good to require converting those many thousands of hashes in our expressions. (Note that it's infeasible to convert among those hashes unless you have the whole content.) IPFS people might best suggest how to work around this. I imagine we want to "serve" a mapping from the hashes we use to the IPFS's hashes, perhaps realized through IPNS. (I don't know details of IPFS's design, I'm afraid.) There's an advantage that one can easily verify the nix-style hash in the end after obtaining the paths in any way.

Non-fixed content

If we get that far, it shouldn't be too hard to manage distributing everything via IPFS, as for all other derivations we use something we could call indirect content addressing. To explain that, let's look at how we distribute binaries now – our binary caches. We hash the build recipe, including all its recipe dependencies, and we inspect the corresponding narinfo URL on cache.nixos.org. If our build farm has built that recipe, various information is in that file, mainly the hashes of the content of the resulting outputs of that build and crypto-signatures of them.

Note that this narinfo step just converts our problem to the previous fixed-output case, and the conversion itself seems very reminiscent of IPNS.

Deduplication

Note that nix-built stuff has significantly greater than usual potential for chunk-level deduplication. Very often we do a rebuild of a package only because something in a dependency has changed, so there are only very minor changes expected in the results, mainly just exchanging the references to runtime dependencies as their paths have changed. (In seldom occasions even lengths of the paths would change.) There's a great potential to save on that during distribution of binaries, which would be utilized by implementing the section above, and even potential in saving disk space in comparison to our way of hardlinking equal files (the next paragraph).

Saving disk space

Another use might be to actually store the files in a FS similar to what IPFS uses. That seems a little more complex and tricky thing to deploy, e.g. I'm not sure someone already trusts the implementation of the FS enough to have the whole OS running of it.

It's probably premature to speculate too much on this use ATM; I'll just write I can imagine having symlinks from /nix/store/foo to /ipfs/*, representing the locally trusted version of that path. (That's working around the problems related to making /nix/store/foo content-addressed.) Perhaps it could start as a per-path opt-in, so one could move only the less vital paths out of /nix/store itself.


I can help personally with bridging the two communities in my spare time. Not too long ago, I spent many months on researching various ways to handle "highly redundant" data, mainly from the point of view of theoretical computer science.

vcunat commented 6 years ago

I think that's exactly the link I had seen. It's data until December 2017, but that should still be good enough for a rough picture.

Unfree packages aren't cached as a matter of policy, in some cases even distribution of sources isn't legally allowed by the author. Yes, switching to IPFS would make it possible to decentralize that decision (and the legal responsibility), which might improve the situation from your point of view. But... you can use fetchIPFS for those already ;-) (and convince people to "serve" them via IPFS) – I don't expect anyone would oppose switching the undownloadable ones to fetchIPFS in upstream nixpkgs.

cleverca22 commented 6 years ago

@Warbo https://github.com/NixOS/nixpkgs/blob/082169ab029b4a111309f7d9a795b88e6429222c/pkgs/build-support/fetchurl/default.nix#L38-L43

pkgs.fetchurl already supports a list of URL's and will try each one in order until one returns something so its just a matter of generating a call to fetchurl, that knows the ipfs hash, sha256, and the original upstream url

Warbo commented 6 years ago

@cleverca22 Wow, now that you point it out it's obvious; I've looked through that code so many times, but the ability to give multiple URLs didn't stick in my mind, maybe because I've not used it (because I forgot it was possible... and so on) :P

I've moved my other thoughts to #2408 since they're not specific to IPFS.

CMCDragonkai commented 5 years ago

Unfree packages aren't cached as a matter of policy, in some cases even distribution of sources isn't legally allowed by the author. Yes, switching to IPFS would make it possible to decentralize that decision (and the legal responsibility), which might improve the situation from your point of view. But... you can use fetchIPFS for those already ;-) (and convince people to "serve" them via IPFS) – I don't expect anyone would oppose switching the undownloadable ones to fetchIPFS in upstream nixpkgs.

I want to also get these standard deep learning weights into Nixpkgs as well: https://github.com/fchollet/deep-learning-models/releases

But they are large fixed output derivations. Weights represent source code when there is more and more deep learning applications coming out. For example libpostal.

Someone on IRC mentioned it shouldn't be cached by hydra or something like that. At any case, I want to make use of Nix for scientific reproducibility, the only way to truly make Nix usable for all of these usecases and not bog down the official Nix caching systems with all our large files is to decentralised the responsibility. So another reason IPFS would be important here.

I was wondering if anyone considered Dat?


On another note, I had some work previously involving attempting to get Hydra integrated with IPFS. To do that we had to look more deeply in the IPFS functionality specifically its libp2p framework. We have moved on to other things for now, but we have some knowledge about this particular area. For deeper integration between Nix and IPFS beyond just fetchIPFS, feel free to put up issues in https://github.com/MatrixAI?utf8=%E2%9C%93&q=libp2p&type=&language=.

LnL7 commented 5 years ago

I might be missing something, but I'm not sure what's particularly large about that.

vcunat commented 5 years ago

I see two downloads over 300 MiB each, so perhaps that. (I don't know particulars at all.)

timokau commented 5 years ago

The IPFS team apparently made package managers their top priority for 2019 :tada:

davidak commented 5 years ago

They made a nice blueprint graphic how people mirror packages which includes Nix, but they need help with details. I think nobody had done it except @mguentner, but also more general details like how is our "package registry" git-powered and how are build products related. One point is, you don't have to mirror the whole package cache to use specific version of nixpkgs. You actually don't need it at all. In worst case you can build the whole system from nixpkgs + source files. https://github.com/ipfs/package-managers/issues/86

They also talked about Nix in the meeting yesterday. https://github.com/ipfs/package-managers/issues/1#issuecomment-525384207

They are also working on performance issues. So we might be able to share packages soon.

(I like their way of organising and outlining use cases and wish we had that too, so we can focus to solve real problems like reproducible environments for universities and HPC without the struggle it is to get started using Nix today...)

mguentner commented 5 years ago

Also keep in mind that one unique design assumption of Nix / Hydra is that the resulting binary cache grows without much structure (at least that I know of). That means each evaluation of a jobset piles more binaries (.nars + narinfo) without some containment like folders. That makes garbage collection based on closures / jobset / evaluation quite hard. If a binary cache is stored in a limited environment (not fastly / S3 / CDN with attachable storage), you will run into problems. Most of my work over at https://github.com/NixIPFS/nixipfs-scripts is mapping Hydra outputs (.nars + narinfo) to folders using symbolic links. All symbolic links point again to a global store. If you would delete a folder of an evaluation, you only delete symbolic links. To garbage collect the global store you only need to check for all files that have no more links from the individual folders. The output was first put into IPFS but later synced using rsync for performance reasons.

Also note that the approach taken was the naive approach as it only leverages distributed filesystem sharing part of IPFS using a central build service and a central exporter. That's also the reason why you can simply use rsync or simply scp for network transfers. IPFS has much more features like a pub/sub service, support for linked data structures (see https://github.com/ipld/ipld) which could be really useful for building something that integrates way deeper into the architecture of Nix (if desired).

andir commented 4 years ago

I just noticed that the year has rotated and was a bit let down by the fact that we still don't have any progress on this. (Can't really blame anyone :))

Started reading into what @mguentner said last and found a potentially relevant discussion on hackernews how IPLD: https://news.ycombinator.com/item?id=13441305 (https://web.archive.org/save/https://news.ycombinator.com/item?id=13441305)

Ericson2314 commented 4 years ago

@andir we'll get to it. The trick is getting both sides to agree on how to hash data to avoid indirection. Steps building up to that would be CA-derivations and hashing data using something common like git tree hashes instead of nar. Guess what I was thinking of (after code quality) with https://github.com/NixOS/nix/pull/3455 ? :)

ohAitch commented 4 years ago

Ideally something still rolling-hash chunked like bup rather than pure git, to be able to store binary outputs without overmuch duplication?

parkan commented 4 years ago

@andir @Ericson2314 I have some news that might be helpful! we've recently launched the IPFS Grant Platform (not announced publicly just yet 🀫) and Nix <> IPFS integration work seems like an ideal grant candidate

this could take the shape of either a direct grant to someone in the Nix community to work on the problem or a jointly formulated call for proposals from 3rd parties

would this be helpful?

parkan commented 4 years ago

ok, I'm seeing a lot of πŸŽ‰ -- who is the best person in the Nix community to chat with to make this happen?

Ericson2314 commented 4 years ago

@parkan Well I don't want to flatter myself as the single best person, but given the work I've been doing on adjacent issues, I'd be happy to kick off the conversation. How should I reach you?

parkan commented 4 years ago

@parkan Well I don't want to flatter myself as the single best person, but given the work I've been doing on adjacent issues, I'd be happy to kick off the conversation. How should I reach you?

dropped an email to the address in your github bio πŸ™‚

nixos-discourse commented 4 years ago

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375/1

Ericson2314 commented 4 years ago

https://discourse.nixos.org/t/obsidian-systems-is-excited-to-bring-ipfs-support-to-nix/7375 the fruit of @parkan's and my discussion! [edit I guess the bot beat me to it :)]

Ericson2314 commented 3 years ago

I guess I should keep this thread up to date with the major highlights.

Milestone 1 is released!

kamadorueda commented 3 years ago

Adhoc Nix over IPFS

I've found a way we could make Hydra populate an IPFS binary cache on each build and then allow users to consume it and help with the distribution by serving their local portion/copy of the Hydra IPFS binary cache.

Hydra current state

By default a nix.conf is like:

substituters = https://cache.nixos.org
trusted-public-keys = <content of hydra.public>

Which reads NAR files in https://cache.nixos.org.

If one NAR file matches and the content signature can be verified with any of the trusted-public-keys it's used as cache for $ nix-build, $ nix-shell, etc

Hydra required additions (option 1)

The extra steps we need Hydra to perform are:

Hydra required additions (option 2)

User required additions

# Replace the var $HYDRA_IPNS with the static IPNS hash provided by Hydra

substituters = file:///path/to/ipns/$HYDRA_IPNS https://cache.nixos.org
trusted-public-keys = <content of hydra.public>

That's it! I've tested it and it works, required steps on the user-side are short and effective, Hydra new steps should be technically possible

kevincox commented 3 years ago

I could be wrong but I think having the entire caching sitting on a single machine is probably infeasible. I think you are right that you could publish via IPNS but I think you would have to simply add a file entry to the existing directory.

This is technically possible but the last time I looked into the IPFS tools it was a complete mess. It seems that you have the following options:

The last one isn't pretty, but probably required. The better solution is probably fix up go-ipfs to have a nice interface for that and use it.

zimbatm commented 3 years ago

I wonder how much ram it would take to load all the derivation hashes from the cache.

For IPFS to scale it will be necessary to split the announcement from the actual storage. Basically, have one or more hosts announce the store paths and when they get the request for the actual file, they would retrieve it from the binary cache. A bit like what https://github.com/ipfs/go-ds-s3 does.

kevincox commented 3 years ago

I wonder how much ram it would take to load all the derivation hashes from the cache.

If I had to guess this is probably feasible, but citation needed. However you don't even need to do that, with directory sharding and similar you can do partial updates with only a subset of the directory tree locally. (There is no logical tree but the sharding adds one). Just make sure you pick the sharding algorithm that allows this. Of course this is also in theory. I don't know if any current implementation actually supports this type of operation.

For IPFS to scale it will be necessary to split the announcement from the actual storage

I agree. It is far too high of a maintenance cost for us to run our own storage servers, so we would want to farm out to something like s3 that manages it for us. I'm not sure it actually "doesn't scale" if we wanted to run our own servers, I guess it depends on how much overhead would be running the publishing logic on the storage servers. At the end of the day you do need some machines with storage attached.

As noted with the project you linked this is entirely possible with IPFS.

kamadorueda commented 3 years ago

Following this concepts: https://github.com/NixOS/nix/issues/859#issuecomment-718355215

I wrote a whitepaper of the exact steps we need to follow in order to create a software that mirrors any binary cache over IPFS

The benefits are the same of the ones described in the issue

In short the implementation allows users to become peers of the distribution network just by using Nix normally (after launching some easy magic commands) and, of course, get the benefits of fetching data over IPFS instead of HTTP.

The implementation is serverless, every user launch a small server locally and there is nothing we have to modify on Hydra or the core of Nix, we've always had it all!

Please read it here! https://github.com/kamadorueda/nix-ipfs/blob/latest/README.md

And let me know your thoughts!

mohe2015 commented 3 years ago

@kamadorueda I really like the approach but I have one question: Wouldn't it make more sense to write the fuse file system in Rust or C for better performance? Depending on network speed I could imagine that python isn't that fast. But maybe you already considered that and have a longer explanation of your decision.

iavael commented 3 years ago

@kamadorueda looks nice, but I have one question: why fuse filesystem? Wouldn't it be easier (and more scalable) to use HTTP interface?

kevincox commented 3 years ago

I agree the HTTP interface sounds like a much cleaner solution. Then the user can just use {gateway}/ipns/{key} as the cache. Where gateway can be a local IPFS gateway (http://localhost:8080) or a public gateway (https://cloudflare-ipfs.com)

Furthermore this allows configuring multiple IPFS caches to trust trivially, instead of needing to run another fuse filesystem locally for each IPFS cache you want to support. In fact this works today with no additional configuration necessary.

# /etc/nix/nix.conf
substituters = https://cloudflare-ipfs.com/ipns/{ipns-key}
trusted-public-keys = {nix-key}
iavael commented 3 years ago

@kevincox as far as I understand, the problem with ipns is that you have to populate all nix cache keys and repopulate them on cache update. So @kamadorueda proposed essentially a proxy which translates nix hash to ipfs hash. And my question was why use fuse interface for this proxy instead of http?

kamadorueda commented 3 years ago

@kamadorueda I really like the approach but I have one question: Wouldn't it make more sense to write the fuse file system in Rust or C for better performance? Depending on network speed I could imagine that python isn't that fast. But maybe you already considered that and have a longer explanation of your decision.

@kamadorueda looks nice, but I have one question: why fuse filesystem? Wouldn't it be easier (and more scalable) to use HTTP interface?

Both of you are right, and HTTP interface could work and be simpler to implement! Pending to do some tests and update the README

I wrote the examples on Python because that's the language I know more. Truth is that given this is an Input / Output bound problem then a language with low concurrency costs would be the most performant

At the end of the day what the community knows more is better as it allows the project to receive more contributions! network bandwidth is the bottleneck

I agree the HTTP interface sounds like a much cleaner solution. Then the user can just use {gateway}/ipns/{key} as the cache. Where gateway can be a local IPFS gateway (http://localhost:8080) or a public gateway (https://cloudflare-ipfs.com)

Furthermore this allows configuring multiple IPFS caches to trust trivially, instead of needing to run another fuse filesystem locally for each IPFS cache you want to support. In fact this works today with no additional configuration necessary.

# /etc/nix/nix.conf
substituters = https://cloudflare-ipfs.com/ipns/{ipns-key}
trusted-public-keys = {nix-key}

The good thing about serving the substituter as a fuse filesystem or some localhost:1234 is that there is no need to deal with trust (take into account that adding a bad-substituter can yield a full host takeover and lots of damage to the info in your system) with a local substituter all you have to trust is yourself and the upstream binary cache (cache.nixos.org, your-own-and-trusted-cachix, etc)

If we use the implementation as it is, the only ipfs command needed is ipfs get <hash> which is automatically protected by cryptography

kevincox commented 3 years ago

So @kamadorueda proposed essentially a proxy which translates nix hash to ipfs hash.

An IPFS directory is a translation of filename -> IPFS hash. I guess this does sidestep the current issue of incremental IPFS directory update. However for the long term it is probably best just to fix that issue.

If it's not but it's available on a binary cache, stream it from the binary cache to the user AND add it to the user IPFS node.

I missed this bit. Currently this can't be done by hitting the gateway directly. However I wonder if it would just be easier to have a cron job that adds the current store to IPFS every once and a while instead of a proxy? However either solution would be good.

Also if we are doing this how does the user publish this info? Just uploading the nar isn't enough to let other people use it.

iavael commented 3 years ago

@kevincox I don't know how many keys are there in nix binary cache, but preloading many pb of data with cron job sounds impractical. I think even creating ipns directory of all keys in cache (and regulary update it) is a bit too much by itself.

kamadorueda commented 3 years ago

If it's not but it's available on a binary cache, stream it from the binary cache to the user AND add it to the user IPFS node.

I missed this bit. Currently this can't be done by hitting the gateway directly. However I wonder if it would just be easier to have a cron job that adds the current store to IPFS every once and a while instead of a proxy? However either solution would be good.

Also if we are doing this how does the user publish this info? Just uploading the nar isn't enough to let other people use it.

For the moment users are just Mirroring binary caches over IPFS

In other words, users download the data they need from (the upstream binary cache / the nearest ipfs node that has it)

The upstream binary cache is who has the .narinfo files (small metadata files) and the distributed ipfs swarm (other people) is who has the nar.xz (may be big content files)

Users CANT announce store paths at discretion (security, trust problems, it's hard to implement but it's possible).

Users can only announce and receive from peers store paths that are in the upstream binary cache (cache.nixos.org, etc). If it exists on the upstream binary cache then it's trusted

I think we can start implementing this read-only proxy, the benefits are HUGE, mainly in costs savings for all involved parties and speed of transfer. This benefits both the nixos/cachix infrastructure, and end-users

Implementing a write-proxy is possible, it's hard but it's possible, I just think it's better to go step by step, solving problems and adding value every day. Start with the smallest possible change that changes things for better

kevincox commented 3 years ago

It sounded like this was being done by the client right? This is just things that you have on your disk already. And since the nix store is immutable IPFS doesn't even need to copy the data.

I think even creating ipns directory of all keys in cache (and regulary update it) is a bit too much by itself.

I doubt it. The amount of work per-narinfo is based on the depth of the IPFS directory tree. IPFS can easily store thousands of directory entries in a single block so the depth is logarithmic with that base. This means that while the amount of work will grow over time it will still be relatively small.

The slightly more concerning number may be that the NixOS project may want to host all of those narinfo files forever. This will likely require something slightly more complicated than just pinning the tree however we currently pay for all of the narinfo and nar on S3 so I can't imagine that it is much worse.

I would love to see info on the total size of narinfo files in the current cache.nixos.org.

kevincox commented 3 years ago

The upstream binary cache is who has the .narinfo files

Ah, so this is just proxying the narinfo requests? The doc isn't very clear on the difference between how the narinfo vs the nar are handled.

If you are just proxying the narinfo you can do something very cool. You can just transform the url parameter to point at the user's preferred gateway. (I'm assuming that that field supports absolute URLs, if not it shouldn't be that hard to add).

Then your proxy doesn't even see the nar requests. (And performance becomes mostly a non-issue).

Furthermore if this becomes widespread then we can at some point start publishing all the narinfos (pointing to IPFS) directly and remove the need for the proxy all together. This also allows people to publish their own caches via IPFS without needing to serve HTTP at all.

kamadorueda commented 3 years ago

From the nixos team perspective they pay S3 storage + data transfer

If we implement the proxy as I propose it nixos team would spend the same on S3 storage, but less in data transfer because some objects would be fetched by client from other clients in the IPFS network instead of S3 (or Cloudfront)

Basically users become a small CDN server of the derivations they use, care about, and have locally

There is no need for pinning services, $0 cost for it

Users benefit from speed, binary caches benefit from costs savings, win-win, the added cost is the time it take us (the volunteers) to create such software: https://github.com/kamadorueda/nix-ipfs

iavael commented 3 years ago

@kevincox wouldn't ipns approach require to list all keys of binary cache for every cache update? I don't think that there are thousands of them, it's more likely there are much much more. And I don't even touch properties of ipfs and it's scalability. At first is it practical to create listing of millions of keys (or even dozens/hundreds of millions) for every cache update with cron job?

kamadorueda commented 3 years ago

The upstream binary cache is who has the .narinfo files

Ah, so this is just proxying the narinfo requests? The doc isn't very clear on the difference between how the narinfo vs the nar are handled.

Is it more clear know?

this is because nar.xz files are content-addressed, but narinfo are not. ipfs is content-addressed and that's why it's possible with nar.xz but not possible with narinfos

kevincox commented 3 years ago

wouldn't ipns approach require to list all keys of binary cache for every cache update

No, you can do incremental updates. It is just a tree and you don't need to recompute unchanged subtrees. (Although currently the implementations that do this are not the best. I think we can use the go-ipfs mutable filesystem API as the scale of narinfos is small. However in the future we may need to implement something new, however that shouldn't be that hard).

kevincox commented 3 years ago

nix requests for the narinfo files go to the upstream binary cache always

However IIUC we need to proxy the request so that we can modify it to point the url field at the proxy. (Although I guess since most caches use relative URLs we don't actually change anything, but in theory we would need to for non-relative URLs).

nix requests for the nar.xz file go to the upstream binary cache OR another peer that has such nar.xs file, the the user becomes a peer of such nar.xz file

That makes sense. One thing to be aware of here is the timeout for when the file isn't on IPFS yet. This may result in more fetches but otherwise the user could be left there waiting forever.

kamadorueda commented 3 years ago

That makes sense. One thing to be aware of here is the timeout for when the file isn't on IPFS yet. This may result in more fetches but otherwise the user could be left there waiting forever.

Sure, this one is easy! thanks

lordcirth commented 3 years ago

Yes, you'll want a short timeout on the IPFS lookup. If something doesn't exist, it can take a long time for IPFS to decide that by default - you can't really prove it doesn't exist, you just have to decide when to give up. Since you have a good fallback, the best user experience is to give up much more quickly than normal. However, if I understand correctly, fetching the file from cache.nixos.org still results in adding the file to IPFS for future users, right?

kamadorueda commented 3 years ago

I just updated the document taking into account everything you guys said! The change has so many deltas that I think it's faster to read it all again

https://github.com/kamadorueda/nix-ipfs/blob/latest/README.md


Yes, you'll want a short timeout on the IPFS lookup. If something doesn't exist, it can take a long time for IPFS to decide that by default - you can't really prove it doesn't exist, you just have to decide when to give up. Since you have a good fallback, the best user experience is to give up much more quickly than normal. However, if I understand correctly, fetching the file from cache.nixos.org still results in adding the file to IPFS for future users, right?

yes, that's right! you may want to read this section (added a few minutes ago) https://github.com/kamadorueda/nix-ipfs/blob/latest/README.md#implementing-the-local-server

kevincox commented 3 years ago

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

The only other nit is that you hardcode the assumption that nars live at nar/* which I don't think is required.

kamadorueda commented 3 years ago

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

Man I did the math trying to translate the nix-sha256 into the IPFS CID and couldn't :(

I think I couldn't do it because the CID stores the hash of the merkle-whatever-techy-thing-composed-of-chunked-bits-with-metadata-and-raw-data-together instead of the nix-sha256 of the raw-data only

so nix_sha256_to_ipfs_cid(nix_sha256_hash_as_string) is not possible in terms of math operations it's possible in terms of OS/network commands if we download the entire data in order to ipfs add it and get the merkle-whatever hash (but this defeats the purpose of the entire project)

If you have any idea on this tell us please! of course that translation service is something I'd prefer not to develop (and pay for) but seems needed until know

The only other nit is that you hardcode the assumption that nars live at nar/* which I don't think is required.

That's true, although nothing to worry for now I think. If we follow the URL field of the .narinfo everything would be ok

kamadorueda commented 3 years ago

We turn this FileHash into an IPFS CID by calling a remote translation service

I'm pretty sure the is no need for a translation service. You can just decode and re-encode the hash.

If it's not possible in terms of math-only (I wish I'm wrong) something really helpful that saves us the translation service would be having a new field for the IPFSCID in the .narinfo.

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

kamadorueda commented 3 years ago
$ nix-hash --type sha256 --to-base16 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz
ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d

$ sha256sum 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz 
ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d  17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz

$ ipfs add -q 17g1n8hxhq7h5h4jh0vy15pp6l1yyy1rg9mdq3pi60znnj53dzzz.nar.xz
QmPW7pVJGdV4wkANRgZDmTnMiQvUrwy4EnQpVn4qHAdrTj

https://cid.ipfs.io/#QmPW7pVJGdV4wkANRgZDmTnMiQvUrwy4EnQpVn4qHAdrTj

base58btc - cidv0 - dag-pb - (sha2-256 : 256 : 1148914FBEEBDBB92D2DEC92697CFA76D7D36DA30339F84FCE76222941015BA2)

ipfs sha256: 1148914FBEEBDBB92D2DEC92697CFA76D7D36DA30339F84FCE76222941015BA2
nix  sha256: ffff368ab4f60313efc0ada69783f73e50736f097e0328092cf060d821b2e19d

ipfs hash is the hash of a data-structure composed of metadata and linked chunks, nix hash is just the hash of the raw content

image

kevincox commented 3 years ago

Ah shoot you are right. The file will at least have the proto wrapper. And it gets more complicated if the file is multiple blocks in size (which it probably is). I think I was confused but the IPFS git model because it has isomorphic hashes. However it appears that it doesn't really work, it just breaks for files larger than a block. I guess I'll sleep on it and see if there is something clever we can do.

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

Of course this forces the chunking strategy to be the current default. It would probably be better to use variable length hashing. (This is probably something worth adding to the current design). But either way encoding the CID without actually pinning the file to IPFS or somehow indicating the chunking method will probably result in issues down the line.

Ericson2314 commented 3 years ago

I do still hope my idea at the bottom, https://discuss.ipfs.io/t/git-on-ipfs-links-and-references/730/24, will work. It could work for nars too (modern IPFS underneath the hood cares more about the underlying multihash than the multicodec part of the CID).

kamadorueda commented 3 years ago

In such case I think nix-copy-closure should be modified to add: IPFSCID = $(ipfs add -q --only-hash <.nar.xz>) (this just hash, this stores nothing in the host)

Of course this forces the chunking strategy to be the current default.

this one can be specified

-s, --chunker string - Chunking algorithm, size-[bytes], rabin-[min]-[avg]-[max] or buzhash. Default: size-262144.

so maybe adding another field to the .narinfo: IPFSChunking = size=262144, could work

this way the ipfs add can be reproduced on any host, past or future

from a user perspective the ipfs get will work for any chunking strategy

It would probably be better to use variable length hashing. (This is probably something worth adding to the current design). But either way encoding the CID without actually pinning the file to IPFS or somehow indicating the chunking method will probably result in issues down the line.

Maybe, yes, someone who reads the .narinfo can be tempted to think the file is pinned/stored somewhere on the ipfs swarm and then discover it's not

at the end of the day I think this is kind of intended-behaviour, everyone knows data can be available, and then not! only data that people cares about remains over time

kevincox commented 3 years ago

so maybe adding another field to the .narinfo: IPFSChunking = size=262144, could work

Yeah, I think that would be a necessary addition if we are going to do that.

from a user perspective the ipfs get will work for any chunking strategy

Yes, but my understanding is that this proposal relies on users uploading the nar. And if they can't upload the nar and end up with the same hash no one will never be able to download it from IPFS.

at the end of the day I think this is kind of intended-behaviour, everyone knows data can be available, and then not! only data that people cares about remains over time

This is an okay guarantee if we want to keep the fallback forever. However it would be nice if this was a solution that could potentiality replace the current cache. (Of course an HTTP gateway would be provided for those not using IPFS natively)


I'm starting to wonder if this is the best approach. What about something like this:

  1. Publish a feed of narinfo files published to cache.nixos.org
  2. Write a service that consumes this feed and:
    1. Implement an IPFS store that is backed by HTTP and use this to advertise the hash.
    2. Publishes its own narinfo files that point at IPFS. (the URL field is modified to /ipfs/{hash})
    3. For now this could be any sort of storage.
    4. Eventually it would be nice to use a directory published in IPNS.

The obvious downside is that the service itself will use more bandwidth as it needs to upload the nar files (hopefully only occasionally). It also requires writing an IPFS Store HTTP backend that doesn't yet exist (AFAIK).

The upsides are:

  1. The cache could transparently be made self-standing. By hosting the nar files directly we can remove s3 from the equation (if one day this becomes the most popular solution).
  2. No extra software to run on the client. The client only needs to run an IPFS node. (Or use a public gateway, but probably bust to encourage running your own node)
  3. Could be run "fully decentralized" with the IPNS directory. (Although we need to have someone publishing it)
  4. The "uploader" is the one writing the hash so there is no concern with chunking. It can be changed at any time and be fine.

I think the thing I like about this is that it is simple to the user. It just looks like we are hosting a cache over IPFS. They don't need to worry about proxies and explicitly uploading files that they downloaded.


It is probably also worth pointing out that the long-term solution is probably to drop nar files altogether and just upload the directories themselves to IPFS. I think all of the proposals could work fine with this, you just need to add a field to the narinfo saying that it is a directory rather than an archive. However this would require much bigger client-side changes and would not be directly accessible over HTTP gateways. So I think that is a long way off.