Store derivation src data in a json file

jlesquembre commented 1 year ago

Issue description

There are many different tools to generate updates for nixpkgs (nixpkgs-update, update-source-version, custom script in passthru.updateScript, nurl, ... ) Many of this tools are scanning nix files, and doing string substitution, which is quite fragile.

Goal

To make it easier to consume and manipulate the src data, I propose to move the fetchers data to a json file. Moving from this:

{ stdenv }:
stdenv.mkDerivation {
  src = fetchFromGitHub {
    owner = "owner";
    repo = "repo";
    rev = "1.0";
    hash = "sha256-B3Yd4lxdwqfCnfmZdp+i/Mzwn/aEuZ0ovagDxuR6lxo=";
  }; 
}

to this:

{ stdenv }:
stdenv.mkDerivation { 
  src = stdenv.srcFromJSON ./src.json;
}

{
  "fetcher": "fetchFromGitHub",
  "fetcherArgs" : {
    "owner": "owner",
    "repo": "repo",
    "rev": "1.0",
    "hash": "sha256-B3Yd4lxdwqfCnfmZdp+i/Mzwn/aEuZ0ovagDxuR6lxo="
  }
}

Details about the implementation are up for discussion, the main idea is to store fetcher's data as data in a JSON file.

jtojnar commented 1 year ago

We did this in GNOME package set in the past and it made doing updates manually annoying, having to change multiple files at the same time.

And tools would still need to scan Nix files to find stdenv.srcFromJSON (in case multiple sources are used, or if stdenv.srcFromJSON did not expose the file path in src attribute).

Or we would need to rely on a fragile convention that source is always stored in src.json. And if we added support for multiple sources, JSON would require it to be denormalized, which would make it easier to use inconsistent sources for packages like https://github.com/NixOS/nixpkgs/blob/bb16729664944850472d40b1c30012c17d45134e/pkgs/tools/graphics/gmic/default.nix#L53.

If we are to fix the fragility, the best way forward is probably making the updating tools AST-based.

jlesquembre commented 1 year ago

@jtojnar Can you expand on why it was more annoying to do manual updates to JSON files vs nix files?

I don't think that the convention of storing the source data in a src.json file is fragile. JSON advantage is its universality, every language supports it, while only a few languages (only Rust and Haskell, I think) have libraries to parse Nix files.

jtojnar commented 1 year ago

Can you expand on why it was more annoying to do manual updates to JSON files vs nix files?

Because I would need to update version and hash in a JSON file, and then do other changes in another file. Also it was not possible to see what version the package is with just a glance at the Nix expression.

I don't think that the convention of storing the source data in a src.json file is fragile.

It is fragile in the sense that the file name is not enforced so we would still need to parse the Nix file to find the file name. Well, I guess we could add assert (builtins.baseNameOf path) == "src.json" to stdenv.srcFromJSON.

infinisil commented 1 year ago

I do think this is a good idea in general, here's some good parts about it:

It establishes a clear separation between content that needs to be updated by humans vs. content that can be updated by a machine
It removes the need for a lot of hacky heuristics in https://github.com/Mic92/nix-update and https://github.com/ryantm/nixpkgs-update
Tooling shouldn't need to modify/generate Nix at all. The value model of Nix is just JSON with functions, but you can't reasonably generate useful functions, so you might as well not use Nix and generate JSON instead. All processing of that JSON can be done in a separate, human-written Nix file.
This moves more towards standardization, maybe allowing third-parties to also use the same approach and tooling, and in general having more stable guarantees

Also, this is pretty much just lockfiles, which is already heavily and successfully used by a lot of tools. Flakes in fact as well, which leads me to think that Flakes and Nixpkgs could use the same standard of locking sources.

jtojnar commented 1 year ago

It establishes a clear separation between content that needs to be updated by humans vs. content that can be updated by a machine

I would argue this distinction is a bit arbitrary. It is just as well feasible for machines to safely do other stuff – first thing that comes to mind is detecting patches that have already been applied and removing it from patches (I have plans for implementing that).

It removes the need for a lot of hacky heuristics in Mic92/nix-update and ryantm/nixpkgs-update

In my experience, the harder part are heuristics for determining the update policy. This is trivial in package managers with registries, just look at the latest available version matching some constraints in the human part. But for general software, this is not enough.

For example, we would want to express things like:

And as far as I can tell, the conclusion was pretty much that this is too hard so it is best left for a general purpose programming language (e.g. updateScript). At that point, I doubt discarding the heuristics for updating the code gives us much.

Tooling shouldn't need to modify/generate Nix at all. The value model of Nix is just JSON with functions, but you can't reasonably generate useful functions, so you might as well not use Nix and generate JSON instead. All processing of that JSON can be done in a separate, human-written Nix file.

As mentioned, we would need at least some form of templating so why just not use Nix’s string interpolation and let bindings. Sure the lockfile could be denormalized but we would still need the normal form somewhere that the update tool can read. So we are either back to heuristics, or trying to come up with a generic enough policy format.

This moves more towards standardization, maybe allowing third-parties to also use the same approach and tooling, and in general having more stable guarantees

Is allowing third-party tools actually worth it? Maybe time would be better spent on the update script protocol (which, I now realize, would probably be of interest to the team).

But even if the standardization is desired, this can at best be the last step. Again, update policy would need to be formalized before it can be useful.

aakropotkin commented 1 year ago

Hey I think y'all would absolutely be interested in a system I wrote that is literally this.

I had to serialize fetcher args for a node2nix style lib and have two implementations of a fetcher serializer with the nix builtins and several nixpkgs fetchers already covered.

Its abstracted so you can swap between builtins and Nixpkgs fetchers seamlessly.

It took a lot of effort to write all of them so using what I have as a foundation could save a lot of time.

One approach used modules and another is "from scratch" with regular nix expressions.

The only note I'll add is that for large JSON files you may have a performance loss compared to a Nix file because the evaluator cache does work on JSON.

Full URL parser, and abstractions for "flake refs": https://github.com/aakropotkin/rime#flake-ref-types

Wrappers for Nixpkgs and builtin fetchers ( uses previous repo, no modules ): https://github.com/aakropotkin/rime#flake-ref-types

Second approach using modules ( simple ): https://github.com/aakropotkin/floco/blob/main/modules/fetcher/interface.nix

Second approach using more complex modules that align with the Nix "inputs"/ libfetchers interface ( potentially useful but much more complicated, not my favorite ): https://github.com/aakropotkin/floco/blob/1464d0098e8e2b28851099df82edeb89dbb53745/lib/types.nix#L167

For the module system the "serialize" and "unserialize" routines handle relative paths correctly, and carry a string field that can be used to indicate which "backend" should be used. For example the following forms are equivalent ( except for the fetcher arg ):

{
  a = {
    type = "tarball";
    url = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz";
    sha256 = "...";
  };

  b = {
    url = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz";
    hash = "...";
    unpack = true;  # optional because of `.tgz' suffix
  };

  c = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz?narHash=...";

  d = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz?sha256=...";

  e = {
    url = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz";
    fetcher = "nixpkgs#fetchZip";
    narHash = "...";
  };

  # `type = "tarball";' is inferred from `.tgz' suffix 
  f = {
    url = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz";
    fetcher = "builtins.fetchTree";
    narHash = "...";
  };

  g = {
    url = "https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz";
    fetcher = "builtins.fetchTarball";
    narHash = "...";
  };

  h = "builtins.fetchTarball:https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz?hash=...";

  # You can set a "preferred backed" for `tarball', default is `builtins.fetchTarball'
  i = "tarball+https://registry.npmjs.org/lodash/-/lodash-4.17.21.tgz?sha256=...";
}

Also this works:

{
  j = "path:./foo";
  k = {
    path = "./foo";
    # This works but probably not recommended for `nixpkgs' ( nifty though )
    # You can also use an string being the name of a declared filter.
    filter = "name: type: ( baseNameOf name ) != \".git\"";
  };
}

NixOS / nixpkgs

Store derivation src data in a json file #295033