denoland / deno

A modern runtime for JavaScript and TypeScript.
https://deno.com
MIT License
97.79k stars 5.38k forks source link

Don't assume paths are always valid UTF8 #627

Open piscisaureus opened 6 years ago

piscisaureus commented 6 years ago

This is a longer term goal.

This was an issue in Node: https://github.com/nodejs/node-v0.x-archive/issues/2387

XAMPPRocky commented 6 years ago

For APIs that take a string for a path wouldn't changing the messages to take byte vectors be redundant as aren't JavaScript strings are guaranteed to be UTF-8, and then to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy which would incur extra unneeded overhead? You couldn't use OsStrExt::from_bytes as that's only available on Unix because Windows encodes their paths as wide strings (16-bit vectors).

piscisaureus commented 6 years ago

JavaScript strings are guaranteed to be UTF-8

Not really actually, technically javascript uses a fixed length 16-bit encoding known as UCS2 (just like java and windows actually). Usually we (browsers, developers) pretend that it uses UTF16, otherwise there'd be no way to encode smileys. But try this in devtools: "😸😸😸".split('').join("-")

to use them as Paths in Rust you'd have to use String::from_utf8/String::from_utf8_lossy

Conversions between Path and utf8-encoded string is actually what I'm saying we should not do! We'll use Path on the rust side and, at least at the lowest level, Uint8Array on the javascript side.

Windows encodes their paths as wide strings (16-bit vectors).

So does javascript :) But I do agree, we shouldn't present a buffer containing a 16-bit encoded string to the user. Instead, let's convert windows filenames to wtf-8 (https://simonsapin.github.io/wtf-8/). Now there we do have a little redundancy, because we'd do the conversion UCS2 (Path) to WTF-8 (UInt8Array) to UCS2 (JS string). But file access on windows is quite slow anyway, nobody would notice that...

You couldn't use OsStrExt::from_bytes as that's only available on Unix because.

I'm sure we can break into it somehow....

NewLunarFire commented 6 years ago

I'm working on this as part of what started as refactoring the is_remote function. We are given UTF-8 encoded strings as arguments so a conversion is inevitable. But from the moment the module resolution starts, they are converted to Rust Paths or Hyper URIs.

This won't win the war, but it's an important battle.

XAMPPRocky commented 6 years ago

@piscisaureus Thank you for explaining. I don't think that would solve the problem presented in the Node issue as you can't encode a string literal as ISO-8859-1 on the JavaScript side. On the Rust side Path can only take PathBuf, String, str, OsStr, and OsString. String requires correct UTF-8 or UTF-16, and the only way to create an OsString from a vector is platform dependent, from_bytes(&[u8]) as I mentioned before on Unix and from_wide(&[u16]) on Windows. The WTF-8 standard is explicitly against being used for interchange, it still requires being emitted as Unicode.

I think the best solution would be to have a facsimile to Rust's OsString for low level APIs.

table OsString {
  unix: [byte];
  windows: [ushort];
}
class OsString {
  private buffer: UInt8Array | UInt16Array;

  constructor(buffer: UInt8Array | UInt16Array) {
    if (windows && buffer instanceof UInt8Array)) {
       throw new Error("OsString requires UInt16Array on Windows");
    } else if (unix && buffer instanceof UInt16Array)) {
       throw new Error("OsString requires UInt8Array on Unix platforms");
    }

    this.buffer = buffer;
  }

  /* @internal */
  into_msg(builder: fbs.Builder): fbs.OsString {
     // Convert and return as flatbuffer message
  }
}
benjamingr commented 6 years ago

as you can't encode a string as ISO-8859-1 on the JavaScript side

You can, if you hold it in an ArrayBuffer and do the conversion with TextEncoder/TextDecoder.

Holding it in a Uint8Array (low level JavaScript API) is what is being suggested here if I understand correctly.

XAMPPRocky commented 6 years ago

@benjamingr Sorry that was explicitly about first two points and for keeping to use Strings/strs for that part of the API and that changing those parts would only add redundancy. Sorry if that was unclear.

  • Don't use String/str for paths on the rust side (we should start doing this today).
  • Use flatbuffer byte vectors instead of strings to encode paths in messages.
NewLunarFire commented 6 years ago

@Aaronepower I already started working on using Paths/URIs instead of Paths on the Rust side. I'm not sure about using byte vectors to encode paths instead of strings. We are going to have to convert paths between textual and canonical representation, this is just shuffling it around to JS's side.

ry commented 6 years ago

@NewLunarFire You can assume that import paths are strings and unicode encoded (if not ascii) - we will not be supporting importing arbitrarily encoded filenames. This issue is about supporting file system ops for non-utf8 filenames (a more general problem).

motss commented 5 years ago

Does that mean that String.prototype.normalize() is not going to work as expected?

bartlomieju commented 5 years ago

Reference: #2572 and #2559

seishun commented 4 years ago

I think the solution to this problem will depend on the following questions:

I'd say the most reasonable answers are "yes" and "no", respectively. I propose the following approach:

What do you think?

ry commented 4 years ago

I'm not convinced it's useful to support non-unicode filenames. It's a huge complexity for what is a legacy problem at this point. It's very nice to assume paths are strings.

Related: in https://github.com/denoland/deno/pull/4004 we skip non-unicode filenames. I think this is reasonable.

stale[bot] commented 3 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.