Open Ericson2314 opened 4 years ago
We can also normalize path separators, since Windows accepts both
In UNC-paths only
\
And Windows-native programs likedir
,del
understand only\
in command-line arguments
Is /
an allowed character in filenames though? If not, we can just losslessly convert /
to \
on the fly.
The problem with
Lua
orbash
+coreutils
(or whatever it be instead them) must do_deletePath
staying in UTF-16 when doing recusion and preserve the UTF-16 names betweenFileFile
andRemoveDirectory
So per https://www.lua.org/manual/5.3/manual.html#3.1 strings in Lua can contain arbitrary bytes (even including nulls), so the WTF-8 or even UTF-16 + unpaired surrogates (the original), at the cost of confusing literals, will work fine.
Or course there could be any lossless representation of UTF-16, but it is not part of the interface.
But the canonical form which is hashed is part of the interface. I would hope "clean" ASCII / Unicode relative paths (assuming something like https://github.com/NixOS/nix/issues/2634 where we don't store the /nix/store
in derivations) have the same hash. This will make transferring data/builds without self-references (also important cause intentional store) between Windows and Unix much, much easier.
But if you meant all store paths are valid unicode (hashed from utf-8), so UTF-16 is only a concern to the bash
+ coreutils
replacement and not Nix itself, than yes, it is an unexposed implementation detail. :)
I marked this as stale due to inactivity. → More info
I closed this issue due to inactivity. → More info
Still interested.
As #2634 points out, we can share derivations and builds between Windows and Unix machines. That means we cannot just be like Rust, new Python, etc., and do both types of path correctly and be done with it. We need to also figure out how to put a Windows path in a Unix store, and Unix Path in a Windows store.
2634 handles problems with the path root (
/...
vs DOS-styleC:\...
vs UNC\\..\...
), this can be just about the encoding. As @conferno points out::To start tackling this issue, I would recommend https://simonsapin.github.io/wtf-8/. Rust uses it too. It can encode any windows path such that valid unicode is meaning-preserved in both directions, and also round trip. It cannot, however, represent non-UTF-8, non-WTF-8 Unix paths on Windows. We cannot fix that because as Windows uses a fixed-length encoding, there is no more room to represent anything else. Beyond representing foreign paths, this is a good canonical form to ensure that "normal" Windows paths have the same hash.
We can also normalize path separators, since Windows accepts both.
Unix paths that are not well-formed WTF-8 I suggest we just ban. Do they exist already, say in
cache.nix.org
?