dbuenzli / uutf

Non-blocking streaming Unicode codec for OCaml
http://erratique.ch/software/uutf
ISC License
30 stars 15 forks source link

Support WTF-8? #11

Closed mroch closed 7 years ago

mroch commented 7 years ago

I'm thinking about using uutf in Flow's javascript parser (https://github.com/facebook/flow). JS strings are UTF-16, but also allow unpaired surrogates (see the spec). for example, var x = "\uDC00" is a valid string.

this stupid encoding has been dubbed "WTF-8": https://simonsapin.github.io/wtf-8/

Would you be amenable to a PR adding support to uutf? I imagine it would be identical to the existing UTF-8 code, except with the malformed checks removed. I could probably refactor to reuse most of the UTF-8 code.

dbuenzli commented 7 years ago

I'm afraid this won't be possible with the current interface: Uutf's is based on the Uchar.t type from the standard library whose values represent Unicode scalar values.

I think the best that could be done would be to have a very clear specification of which Malformed are returned on WTF-8 and maybe a decoding function wtf_8 : [Malformed of string ] -> int` but I'm not sure that's worth the effort and it may be clearer to implement your own decoder.

dbuenzli commented 7 years ago

That said I'm not sure where you actually need that. I don't think JavaScript allows you to write files in WTF-8 (and if it does, it should not, WTF-8 is not supposed to be used in text files).

dbuenzli commented 7 years ago

Ping ?

mroch commented 7 years ago

I ended up writing this: https://github.com/facebook/flow/tree/master/src/third-party/wtf8

I'm planning to split it into an opam module instead of buried inside flow, just need to find time to add build files and whatnot.

Thanks!