NixOS / nix

Nix, the purely functional package manager
https://nixos.org/
GNU Lesser General Public License v2.1
12.2k stars 1.47k forks source link

Encoding store Paths on Windows and Unix #3197

Open Ericson2314 opened 4 years ago

Ericson2314 commented 4 years ago

As #2634 points out, we can share derivations and builds between Windows and Unix machines. That means we cannot just be like Rust, new Python, etc., and do both types of path correctly and be done with it. We need to also figure out how to put a Windows path in a Unix store, and Unix Path in a Windows store.

2634 handles problems with the path root (/... vs DOS-style C:\... vs UNC \\..\...), this can be just about the encoding. As @conferno points out::

I must explain the constraints which makes difficult using of many imperative languages originated from Unix world as stdenv.shell.

  1. Path is UTF-16 string and this is difficult to avoid, because

    1. we hit 256-byte length limit (I won't go into details now, there are some ways to workarond)

    2. files with names which cannot be represented as UTF-8 could be on disk, after fetchzip or in check phase. stdenv.shell must be able at least cp -r and rm -rf directories which have such files inside.

    Thus, internal representation of paths has to be UTF-16, so neither bash+coreutils nor stock perl nor stock lua would work out of the box :(

To start tackling this issue, I would recommend https://simonsapin.github.io/wtf-8/. Rust uses it too. It can encode any windows path such that valid unicode is meaning-preserved in both directions, and also round trip. It cannot, however, represent non-UTF-8, non-WTF-8 Unix paths on Windows. We cannot fix that because as Windows uses a fixed-length encoding, there is no more room to represent anything else. Beyond representing foreign paths, this is a good canonical form to ensure that "normal" Windows paths have the same hash.

We can also normalize path separators, since Windows accepts both.

Unix paths that are not well-formed WTF-8 I suggest we just ban. Do they exist already, say in cache.nix.org?

Ericson2314 commented 4 years ago

We can also normalize path separators, since Windows accepts both

In UNC-paths only \ And Windows-native programs like dir, del understand only \ in command-line arguments

Is / an allowed character in filenames though? If not, we can just losslessly convert / to \ on the fly.

The problem with Lua or bash+coreutils (or whatever it be instead them) must do _deletePath staying in UTF-16 when doing recusion and preserve the UTF-16 names between FileFile and RemoveDirectory

So per https://www.lua.org/manual/5.3/manual.html#3.1 strings in Lua can contain arbitrary bytes (even including nulls), so the WTF-8 or even UTF-16 + unpaired surrogates (the original), at the cost of confusing literals, will work fine.

Or course there could be any lossless representation of UTF-16, but it is not part of the interface.

But the canonical form which is hashed is part of the interface. I would hope "clean" ASCII / Unicode relative paths (assuming something like https://github.com/NixOS/nix/issues/2634 where we don't store the /nix/store in derivations) have the same hash. This will make transferring data/builds without self-references (also important cause intentional store) between Windows and Unix much, much easier.

But if you meant all store paths are valid unicode (hashed from utf-8), so UTF-16 is only a concern to the bash + coreutils replacement and not Nix itself, than yes, it is an unexposed implementation detail. :)

stale[bot] commented 3 years ago

I marked this as stale due to inactivity. → More info

stale[bot] commented 2 years ago

I closed this issue due to inactivity. → More info

Ericson2314 commented 2 years ago

Still interested.

Ericson2314 commented 8 months ago

9205 should help with this