Open piscisaureus opened 6 years ago
For APIs that take a string for a path wouldn't changing the messages to take byte vectors be redundant as aren't JavaScript strings are guaranteed to be UTF-8, and then to use them as Path
s in Rust you'd have to use String::from_utf8
/String::from_utf8_lossy
which would incur extra unneeded overhead? You couldn't use OsStrExt::from_bytes
as that's only available on Unix because Windows encodes their paths as wide strings (16-bit vectors).
JavaScript strings are guaranteed to be UTF-8
Not really actually, technically javascript uses a fixed length 16-bit encoding known as UCS2 (just like java and windows actually). Usually we (browsers, developers) pretend that it uses UTF16, otherwise there'd be no way to encode smileys. But try this in devtools: "😸😸😸".split('').join("-")
to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy
Conversions between Path
and utf8-encoded string
is actually what I'm saying we should not do! We'll use Path
on the rust side and, at least at the lowest level, Uint8Array on the javascript side.
Windows encodes their paths as wide strings (16-bit vectors).
So does javascript :) But I do agree, we shouldn't present a buffer containing a 16-bit encoded string to the user. Instead, let's convert windows filenames to wtf-8 (https://simonsapin.github.io/wtf-8/). Now there we do have a little redundancy, because we'd do the conversion UCS2 (Path) to WTF-8 (UInt8Array) to UCS2 (JS string). But file access on windows is quite slow anyway, nobody would notice that...
You couldn't use OsStrExt::from_bytes as that's only available on Unix because.
I'm sure we can break into it somehow....
I'm working on this as part of what started as refactoring the is_remote function. We are given UTF-8 encoded strings as arguments so a conversion is inevitable. But from the moment the module resolution starts, they are converted to Rust Paths or Hyper URIs.
This won't win the war, but it's an important battle.
@piscisaureus Thank you for explaining. I don't think that would solve the problem presented in the Node issue as you can't encode a string literal as ISO-8859-1
on the JavaScript side. On the Rust side Path
can only take PathBuf
, String
, str
, OsStr
, and OsString
. String
requires correct UTF-8 or UTF-16, and the only way to create an OsString
from a vector is platform dependent, from_bytes(&[u8])
as I mentioned before on Unix and from_wide(&[u16])
on Windows. The WTF-8 standard is explicitly against being used for interchange, it still requires being emitted as Unicode.
I think the best solution would be to have a facsimile to Rust's OsString
for low level APIs.
table OsString {
unix: [byte];
windows: [ushort];
}
class OsString {
private buffer: UInt8Array | UInt16Array;
constructor(buffer: UInt8Array | UInt16Array) {
if (windows && buffer instanceof UInt8Array)) {
throw new Error("OsString requires UInt16Array on Windows");
} else if (unix && buffer instanceof UInt16Array)) {
throw new Error("OsString requires UInt8Array on Unix platforms");
}
this.buffer = buffer;
}
/* @internal */
into_msg(builder: fbs.Builder): fbs.OsString {
// Convert and return as flatbuffer message
}
}
as you can't encode a string as ISO-8859-1 on the JavaScript side
You can, if you hold it in an ArrayBuffer and do the conversion with TextEncoder/TextDecoder.
Holding it in a Uint8Array (low level JavaScript API) is what is being suggested here if I understand correctly.
@benjamingr Sorry that was explicitly about first two points and for keeping to use Strings/strs for that part of the API and that changing those parts would only add redundancy. Sorry if that was unclear.
- Don't use String/str for paths on the rust side (we should start doing this today).
- Use flatbuffer byte vectors instead of strings to encode paths in messages.
@Aaronepower I already started working on using Paths/URIs instead of Paths on the Rust side. I'm not sure about using byte vectors to encode paths instead of strings. We are going to have to convert paths between textual and canonical representation, this is just shuffling it around to JS's side.
@NewLunarFire You can assume that import paths are strings and unicode encoded (if not ascii) - we will not be supporting importing arbitrarily encoded filenames. This issue is about supporting file system ops for non-utf8 filenames (a more general problem).
Does that mean that String.prototype.normalize() is not going to work as expected?
Reference: #2572 and #2559
I think the solution to this problem will depend on the following questions:
I'd say the most reasonable answers are "yes" and "no", respectively. I propose the following approach:
stat
take either a Path or a string as input.readDir
return Paths as output.What do you think?
I'm not convinced it's useful to support non-unicode filenames. It's a huge complexity for what is a legacy problem at this point. It's very nice to assume paths are strings.
Related: in https://github.com/denoland/deno/pull/4004 we skip non-unicode filenames. I think this is reasonable.
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
This is a longer term goal.
This was an issue in Node: https://github.com/nodejs/node-v0.x-archive/issues/2387