NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.53k stars 1.5k forks source link

Remove the position table, add 32 bits to `Expr`? #8208

Open roberth opened 1 year ago

roberth commented 1 year ago

Is your feature request related to a problem? Please describe.

The position table consumes memory and a per-position memory consumption stops us from generating many positions, which is also bad.

Describe the solution you'd like

I don't think we need a position table, if we make clever use of a file table instead. Instead of a table, we could use a coordinate scheme where we split the position bits into two numbers: one for the file and one for picking out the n-th AST node. However, this gets a bit close to the limits of our bit budget, so we'd end up dropping some positions.

A half split gives us 16 bits to index the files, and if you were to evaluate all of a single Nixpkgs, you're close to maxing out 15 bits. In reality, you wouldn't parse all of those, but you may evaluate multiple nixpkgs, especially with flakes, without inter-flake caching. Worse, all-packages, hackage-packages and node-packages will not have position numbers for most of their ast. A rough estimate with `wc -w` gives, at the high end: ``` 41933 pkgs/servers/web-apps/outline/yarn.nix pkgs/servers/web-apps/outline/yarn.nix 47588 pkgs/development/mobile/androidenv/repo.json pkgs/development/mobile/androidenv/repo.json 51079 maintainers/maintainer-list.nix maintainers/maintainer-list.nix 51080 pkgs/servers/monitoring/karma/node-packages.nix pkgs/servers/monitoring/karma/node-packages.nix 57125 pkgs/desktops/gnome/extensions/extensions.json pkgs/desktops/gnome/extensions/extensions.json 98313 pkgs/top-level/perl-packages.nix pkgs/top-level/perl-packages.nix 102093 pkgs/tools/typesetting/tex/texlive/tlpdb.nix pkgs/tools/typesetting/tex/texlive/tlpdb.nix 121178 pkgs/top-level/all-packages.nix pkgs/top-level/all-packages.nix 188951 pkgs/applications/editors/emacs/elisp-packages/recipes-archive-melpa.json pkgs/applications/editors/emacs/elisp-packages/recipes-archive-melpa.json 262215 pkgs/development/lisp-modules/imported.nix pkgs/development/lisp-modules/imported.nix 269585 pkgs/development/lisp-modules-new-obsolete/imported.nix pkgs/development/lisp-modules-new-obsolete/imported.nix 275547 pkgs/development/r-modules/cran-packages.nix pkgs/development/r-modules/cran-packages.nix 323407 pkgs/development/node-packages/node-packages.nix pkgs/development/node-packages/node-packages.nix 1156660 pkgs/development/haskell-modules/hackage-packages.nix pkgs/development/haskell-modules/hackage-packages.nix ```

Instead, a more "efficient" scheme is to use consecutive numbers. Functionally, this is concatenating all files and then indexing that. (But of course not implemented like that)

I propose the following:

# pseudocode

typedef int32_t PosIdx; // already exists but newtyped

class EvalState additions {
  int32_t posCounter = 1;
  std::map<int, Path> filePosRanges;
  std::map<Path, Expr *> fileParseCache; // already exists
}

class Expr additions {
  int32_t posIdx;
}

parse(file):
  // save the start
  filePosRanges.insert(posCounter, file)
  for each expr, as they are constructed in order:
    posIdx = posCounter++

getPos(n):
  fileStart = filePosRanges.lower_bound(n)
  fileExpr <- find n in the Expr cache; returning null pos if not found
  return (find expr with posIdx == n in fileExpr)

I think the numbering will be in post-order.

Possible complications with their solutions

Arbitrary positions: currently you can create a Pos with a truly arbitrary location - an arbitrary character inside an AST node. If we need that, we'd have to reserve a range for those. I think these would be rare.

Serialization of parsing: I don't think we parse concurrently, but if we want to, we could use a pool of file tables and assign a range of positions to each. A parser claims any of the file tables for itself and continues as described before.

Describe alternatives you've considered

A product instead of a sum, as used in the introduced and rejected in the description of the solution; no other ideas so far.

Additional context

Priorities

Add :+1: to issues you find important.

pennae commented 1 year ago

if space is an issue we'd propose something else entirely: reserve the MSB of PosIdx, and if it's set treat the remaining 31 bits as a position. we could split it as eg 12 bits of file index and 19 bits of offset within that file (which would necessitate re-reading a file to generate line/col info) or 8 bits column and 11 bits of line. this should be sufficient for the vast majority of positions, and doesn't require enlarging the (already quite bloated) Expr

roberth commented 1 year ago

11 bits of line

I don't think 2k lines would cut it.

19 bits of [byte] offset

That's half a meg; sufficient for all hand written files, but not generated ones.

The positions in hackage-packages.nix have helped me at times, ctrl+clicking in vscode to see what was generated, so I would prefer to support those too.

How about half-way between your suggestion and mine: use the file offset, but also use the counter to "serialize" the files. That also solves the concurrency issue. By incrementing the counter with the file size, all global mutation can be done in one atomic operation per file.

if space is an issue

My goal is not having to worry about which expressions get a position and which ones don't because of memory. That will help with #8209. I haven't measured the positions table.

pennae commented 1 year ago

The positions in hackage-packages.nix have helped me at times, ctrl+clicking in vscode to see what was generated, so I would prefer to support those too.

we'd still keep the position table for those. but if the position table only services "large" files and most files fit within the 31 LSBs then the position table is no longer a major factor. you're right that 11 bits for lines may not be enough, but we could encode this more efficiently still. say, instead of 11+8 we could use 1+(10+8|4+14) if we allow fow backtracking through all preceding position in a file to calculate a current position (and storing only line offsets, assuming that after large line jumps we won't also see a large column offset). but that's probably not worth the effort, because:

How about half-way between your suggestion and mine: use the file offset, but also use the counter to "serialize" the files. That also solves the concurrency issue. By incrementing the counter with the file size, all global mutation can be done in one atomic operation per file.

that also makes sense. assuming that files don't change during eval is probably safe (and can be detected and diagnosed for), so we wouldn't even have to keep the file text around. that would be way better than positions, at the cost of rereading files for position resolution.

(added note: this approach will be limited to either 4GB of parsed text or 2GB + whatever fits in the position table. it may make sense to build a special allocator for expressions and make positions dynamically sized to break past this, so folks can retry with --use-large-positions if they really need to. that and file-offset-based indexing could completely replace the position table with (probably) no downsides)

(ceterum censeo carthag... cough expression printing as nix does it eg for asserts should also be replaced with a contextful location trace!)