Open roberth opened 1 year ago
if space is an issue we'd propose something else entirely: reserve the MSB of PosIdx, and if it's set treat the remaining 31 bits as a position. we could split it as eg 12 bits of file index and 19 bits of offset within that file (which would necessitate re-reading a file to generate line/col info) or 8 bits column and 11 bits of line. this should be sufficient for the vast majority of positions, and doesn't require enlarging the (already quite bloated) Expr
11 bits of line
I don't think 2k lines would cut it.
19 bits of [byte] offset
That's half a meg; sufficient for all hand written files, but not generated ones.
The positions in hackage-packages.nix have helped me at times, ctrl+clicking in vscode to see what was generated, so I would prefer to support those too.
How about half-way between your suggestion and mine: use the file offset, but also use the counter to "serialize" the files. That also solves the concurrency issue. By incrementing the counter with the file size, all global mutation can be done in one atomic operation per file.
if space is an issue
My goal is not having to worry about which expressions get a position and which ones don't because of memory. That will help with #8209. I haven't measured the positions table.
The positions in hackage-packages.nix have helped me at times, ctrl+clicking in vscode to see what was generated, so I would prefer to support those too.
we'd still keep the position table for those. but if the position table only services "large" files and most files fit within the 31 LSBs then the position table is no longer a major factor. you're right that 11 bits for lines may not be enough, but we could encode this more efficiently still. say, instead of 11+8 we could use 1+(10+8|4+14) if we allow fow backtracking through all preceding position in a file to calculate a current position (and storing only line offsets, assuming that after large line jumps we won't also see a large column offset). but that's probably not worth the effort, because:
How about half-way between your suggestion and mine: use the file offset, but also use the counter to "serialize" the files. That also solves the concurrency issue. By incrementing the counter with the file size, all global mutation can be done in one atomic operation per file.
that also makes sense. assuming that files don't change during eval is probably safe (and can be detected and diagnosed for), so we wouldn't even have to keep the file text around. that would be way better than positions, at the cost of rereading files for position resolution.
(added note: this approach will be limited to either 4GB of parsed text or 2GB + whatever fits in the position table. it may make sense to build a special allocator for expressions and make positions dynamically sized to break past this, so folks can retry with --use-large-positions
if they really need to. that and file-offset-based indexing could completely replace the position table with (probably) no downsides)
(ceterum censeo carthag... cough expression printing as nix does it eg for asserts should also be replaced with a contextful location trace!)
Is your feature request related to a problem? Please describe.
The position table consumes memory and a per-position memory consumption stops us from generating many positions, which is also bad.
Describe the solution you'd like
I don't think we need a position table, if we make clever use of a file table instead. Instead of a table, we could use a coordinate scheme where we split the position bits into two numbers: one for the file and one for picking out the n-th AST node. However, this gets a bit close to the limits of our bit budget, so we'd end up dropping some positions.
Instead, a more "efficient" scheme is to use consecutive numbers. Functionally, this is concatenating all files and then indexing that. (But of course not implemented like that)
I propose the following:
I think the numbering will be in post-order.
Possible complications with their solutions
Arbitrary positions: currently you can create a Pos with a truly arbitrary location - an arbitrary character inside an AST node. If we need that, we'd have to reserve a range for those. I think these would be rare.
Serialization of parsing: I don't think we parse concurrently, but if we want to, we could use a pool of file tables and assign a range of positions to each. A parser claims any of the file tables for itself and continues as described before.
Describe alternatives you've considered
A product instead of a sum, as used in the introduced and rejected in the description of the solution; no other ideas so far.
Additional context
Priorities
Add :+1: to issues you find important.