ccxvii / mujs

An embeddable Javascript interpreter in C.
http://mujs.com/
ISC License
807 stars 98 forks source link

The length of multi-code point Unicode characters is incorrect. #184

Closed ahaoboy closed 9 months ago

ahaoboy commented 9 months ago
console.log('šŸ˜€'.length) 
// mujs: 1
// qjs: 2
// node: 2

Is mujs considering adding support for multi-code point Unicode characters in strings? This seems to be a feature in ES6.

console.log('\u0061'.length)

// error
// console.log('\u{1f600}'.length)
avih commented 9 months ago

Before we get to anything else, the most important question is whether you bumped into a real world use case where it matters?

Please be honest and don't go fishing for use cases now.

This seems to be a feature in ES6.

It's irrelevant to mujs what ES6 supports, because mujs is ES5 only, without plans to make it ES6 or any other version in the future.

However, this is the same with ES5 as well, so let's continue.

The TL:DR; is that it's a known "issue", but considered to have very few practical implications, if at all, and require too much effort to solve compared to the value it adds, hence it's not handled currently.

I said (quoted) "issue" because technically it's not an issue of mujs. mujs does document what needs to happen so that this issue doesn't happen: all input strings, including source files, should be CESU-8 encoded, and output strings are CESU-8 as well, which the client is supposed to convert to whatever encoding they want, like UTF-8.

If clients (like mpv) followed this interface requirements/definition, then there would be no issue and a string of non-BMP codepoint would have a length of 2.

CESU-8 is an encoding similar to UTF-8. It's identical to UTF-8 for BMP codepoints, but non-BMP codepoints are encoded as UTF16-surrogates pairs, where each surrogate half is encoded (UTF-8-like) as if it was a codepoint on its own.

So this CESU-8 interface encoding requires every mujs library client to handle this issue by converting all strings and source files to CESU-8, and then converting all output strings back from CESU-8 to typically UTF-8, instead of fixing it once at the mujs side.

This is why it's not fixed in mpv - because I don't agree with this requirement and consider it an issue of the mujs library interface design/requirements.

The mujs REPR (the mujs stand-alone binary) suffers from the same issue, because, like mpv, it ignores the fact that input strings and source files are pretty much always UTF-8 while the mujs library expects CESU-8.

I'm guessing, but I haven't tested, that mupdf has the same issue.

However, I think it's an unreasonable effort which is imposed on every client which uses the mujs library, like mpv and the mujs REPR, which is why I continue to the long version and hope that mujs improves so that this issue doesn't happen to every client.

Some terminology first:

Is mujs considering adding support for multi-code point Unicode characters in strings?

As you were already explained very nicely by @astiob, and earlier by me, it's not "multi-code point".

It's one unicode codepoint with value above U+FFFF. Codepoints 0-FFFF are the "Basic-Multilingual-Plane" or "BMP" for short, and codepoints above that are typically described as "non-BMP".

Again, as was explained before, JS strings are expected to correlate to UTF-16 "units", where the length is the number of UTF-16 units required to encode this string, and one JS "character" is one UTF-16 unit.

BMP codepoints are encoded as a single UTF-16 unit, and therefor their JS length matches the number of codepoints. But non-BMP codepoints are endcoded as two UTF-16 units (a surrogate-pair), hence the JS string length is expected to be 2 for a string of a single non-BMP codepoint.

The detailed version:

The mujs library stores JS strings as unmodified user-input C-strings - because it's a lot simpler to use strcat and other C-library string functions instead of converting inputs to UTF-16, then working with UTF-16, then converting back to whatever C-string encoding for API output strings.

If the input strings are indeed CESU-8, then a non-BMP codepoint (encoded as CESU-8) would indeed have a length of 2 in JS.

However, because it's possible that mujs library clients (like mpv and the mujs REPR) would not convert to and from CESU-8 on their own, mujs makes an exception and accepts also UTF-8 strings, instead of rejecting them as invalid CESU-8. Specifically, non-BMP encoded as UTF-8 is allowed in input strings and source files, and handled reasonably (as a side-note, mujs will also accept invalid UTF-8, with similar effect as described below).

By "reasonably" I mean that the string will be accepted and will not cause an error, and can generally be used normally by scripts (including concatenation with other strings, etc). Additionally, it will remain UTF-8 (or whatever other C-string it was) as output from mujs to the client - e.g. when a JS script does printf(str) and that string goes back to the client to be printed.

So this is why it generally works - UTF-8 in source file strings remains unmodifies in mujs storage and mujs output, and so clients would get UTF-8 output from scripts which are saved as UTF-8 files.

However, and finally it's our issue: such codepoint would be counted as one "character" in a JS script - which is non standard.

So how should it be handled?

One solution is to keep it as is: every mujs library client needs to convert to and from CESU-8 on their own. Ugly solution, and a lot of duplicated effort which no client is likely to do, resulting in the current state: it's effectively always broken, but technically not mujs fault.

Or maybe it is mujs fault, because (from the linked CESU-8 wikipedia page):

CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only.[2] It should be used exclusively for internal processing and never for external data exchange.

The other approaches below change the mujs API to be UTF-8 instead of CESU-8, because everyone uses UTF-8 anyway in C-strings and script files, and no one is using CESU-8 of their free will.

One approach is the "Right Thing": convert all input strings from UTF-8 to UTF-16 for inputs and source files, and back to UTF-8 for output strings. This would have a big impact both on the API "edges" (converting every API string between UTF-8 and UTF-16) and also on mujs internals and a lot of string operations implementation. However, it will also simplify some things, like indexing and counting. I don't know how much effort it would be in practice.

Another approach is to not convert input/output strings, and instead change how mujs counts non-BMP UTF-8 codepoints in strings. This would solve the length issue but it doesn't scale easily, because there are other issues, like splitting a string between (now virtual) surrogate halfs, or fetching one (virtual) surrogate half value by index, and, more importanly, it will not be correct UTF-8 if concatenated back naively into JS string and then the client reads this string.

However, with some effort (while counting, while splitting or indexing, and while concatenating - to make it back valid UTF-8 where possible), this might be possible. The question is: who makes this effort?

Another approach is that mujs itself converts from UTF-8 to CESU-8 for every input string and source file, and from CESU-8 to UTF-8 for every output string.

This is both simpler than it sounds and harder than it sounds.

It's simpler because many UTF-8 string are already valid CESU-8 - every string which only consists of BMP codepoints. So typically the conversion only checks that there are no non-BMP UTF-8 codepoints and that's it, and it's the same the other way around - if there are only BMP codepoints then there's no need to convert back from CESU-8 for output strings, because it's already valid UTF-8.

However, this still needs to be tested on every input string, and every output string. Also, a converted output UTF-8 string with non-BMP codepoints likely needs a new memory allocation, and someone needs to manage those and free them eventually, which might not be trivial with some of the output API.

However, I have implemented this last approach, many years ago already, as an external single-header-only file, which, if a library client includes this file after including mujs.h, then it magically solves all those issues - converting inputs and source files from UTF-8 to CESU-8, convertiung outputs the other way around, and managing the newly allocated memory where needed.

I'm attaching my implementation here: mujsutf8.zip

It's also currently available online here as part of my mpv experiments fork

It's not incorporated into main mpv because, as I mentioned, I don't think it's a problem client should have to deal with (by including an additional custom 3rd party header).

It does have some minor rough edges (like that it doesn't use the custom allocator, if one was used at js_newstate, or that callbacks, if used, need to be wrapped manually) because it's purely client-side implementation, so it can't operate inside mujs.

But if this becomes part of mujs itself then these rough edges can be eliminated. I have tried to push it into mujs back then (2019), but there appeared to be no interest in taking it.

I also don't know if this approach is better than the other approaches mentioned earlier (converting to/from UTF-16, or not converting but modifing how counting/concat/indexing is performed with virtual surrogate halfs).

However, I have been using it in my own mpv build for about 4 years now, without making any changes to mpv other than including this header after #include <mujs.h>. I have not noticed any issues with this wrapper.

astiob commented 9 months ago

My 2 cents as an outsider, since Iā€™ve been mentioned:

I concur that creating an API that uses CESU-8 is silly. It is rare; it is easy to misuse by supplying/assuming UTF-8 instead; and the encoding is explicitly a ā€œcompatibility encodingā€ to begin with, that is, it is provided as a patch for backwards-compatible APIs that assumed UCS-2 instead of UTF-16 and it should not be used in new APIs. Many languages/frameworks lack support for it entirely, so whoever wants to use mujs is forced to roll their own implementation.

But if mujs does explicitly declare its API to use CESU-8, then it is even more silly to refuse to do so in mainline mpv, especially when itā€™s already implemented. As it is, mpv is violating its API contract with mujs. The contract may be silly, but violating it is absolutely an mpv bug. If you donā€™t want to honour a contract, donā€™t sign it: donā€™t use mujs.

Even if mujs decides to switch to UTF-8 at some point in the future, it will require new APIs for output or a very clearly communicated API & ABI break, because UTF-8 output will be incompatible with the existing APIs that promise CESU-8ā€”unless someone can conclusively prove that no consumer ever has actually bothered to implement CESU-8.

(By the way, @avih, I find that your response sends mixed messages as to whether you actually want this to be fixed. It starts out reading as though you donā€™t and as though you doubt the existence of any issue at all. But then it proceeds to criticize the current API and describe solutions and workarounds. Before I opened GitHub, I suspected this was written by two different people, an mpv developer and a mujs developer, and some of the leading text was meant to be a quote but the quote markup was missing by accident.)

ahaoboy commented 9 months ago

Before we get to anything else, the most important question is whether you bumped into a real world use case where it matters?

Thank you for your detailed explanation. I apologize for raising a question that goes beyond the scope of mujs design. To be honest, there aren't specific cases at the moment; I simply wish for consistency in these fundamental APIs across different JavaScript engines.

avih commented 9 months ago

It starts out reading as though you donā€™t and as though you doubt the existence of any issue at all

The contract may be silly, but violating it is absolutely an mpv bug.

OP clearly comes from mpv after it was pointed out in an mpv issue that the length should be 2 but it's actually 1. If this issue was opened in mpv then I would have acknowledged and closed it as WONTFIX due to the reasons I mentioned earlier, as silly as you may consider it.

But this was opened at the mujs project and not in mpv. Presumably because the JS in mpv behaves like this, and because mpv uses the mujs library, it was assumed to be a mujs library issue, and not because the mujs REPR binary also behaves the same.

Under this definition, and assuming we accept that a CESU-8 is valid for library API, then this is not a mujs library issue, because it's an incorrect usage by mpv. We're in agreement on this.

And so, this is how my reply started, but obviously this is not black and white.

I suspected this was written by two different people, an mpv developer and a mujs developer

I am indeed both. I wrote the JS support in mpv, using mujs, when mujs was very young, and as such, I have reported many initial issue and even fixed and made some improvements myself in mujs, as well as discussed and suggested various ideas, some of them materialized as mujs improvements.

I was involved with the interface definition as well, IIRC I initially reported this very issue (on IRC, where most of the communication took place), and was there when CESU-8 was documented sort of retroactively so that users have some spec they can work with if they care about such issues - but not because it was a design choice. I was also there when it was made to still accept UTF-8 as well for cases where the clients don't actually do this CESU-8 conversion.

So yes, I'm a bit of both, though less so on the mujs side.

I do want this fixed, and I did put non-trivial amount of time, long ago, to draft one approach, even if not necessarily the best one. I don't really care which approach is used, but I also know that while there are certainly use cases where this can bite, they're rare both in general but specifically in the context of mpv.

However, I still disagree that the API should be CESU-8, for the reasons I stated of duplicate NIH API conversions work which no one actually does - not even the mujs REPR and (I'm guessing) mupdf - which was the main reason for mujs.

So that post was both an explanation of what's going on and where the problem lies, as well as criticism about the cause which makes it both hard and silly to adhere, and a hope to get it improved, and then some suggestions and solutions. How's that not a good post? :)

unless someone can conclusively prove that no consumer ever has actually bothered to implement CESU-8.

Whether or not changing the API (back) to UTF-8 would or wouldn't affect existing clients, that's certainly needs to be considered, but my bet (because it's impossible to know for a fact) is that no one added a conversion once it was documented as CESU-8, or at any other time, so I'm guessing it's unlikely to be an issue. If that ever happens.

If that becomes a real issue, then this can become an opt-in thing, but IMO this would hurt more than it would help, but let's not solve it before we're there.

ccxvii commented 9 months ago

So back in the 90's the early adopters of Unicode (Windows, Java, and JavaScript) made a terrible mistake. They assumed that 16-bits would be more than enough bits to encode all characters in the world. And thus they hardcoded 16-bit characters in their APIs and language and called it UCS-2. However, it wasn't so, and soon they had to figure out a way around this. Thus was invented the abomination that is UTF-16 and surrogate pairs. Everybody who had just breathed a sigh of relief that they wouldn't need to worry about multi-byte encodings anymore, now have to deal with multi-byte encodings again! It was all a lie!

MuJS is basically agnostic to the content of strings -- it just treats them as sequences of UTF-8 encoded integers.

If you do the naive thing and just pass around UTF-8 strings (with codes outside the BMP), things will just behave as you would expect. A character is a character, and you don't need to worry about any of that surrogate pair nonsense.

However, as you realize from the example in this error report, this isn't strictly according to spec. The spec needs everyone to bend over backwards to deal with surrogate pairs and the multi-codepoint nonsense that everyone would rather just forget exists.

If your code for some unfathomable reason actually needs this surrogate pair behavior for non-BMP characters, you have the option to use CESU-8 when passing strings to and from MuJS. In this way, all strings will be using surrogate pairs and the Javascript code will look and behave per spec.

We don't do this transformation step automatically in the C API because most users don't care about the details and just want to pass UTF-8 strings back and forth between C and JS. Of those who would care, most might actually want the sane behavior and not have to worry about surrogate pairs in their JS code. And why should these people have to pay the transformation costs of encoding and decoding UTF-8 to UTF-16/CESU-8 internally for no useful reason?

This is why if you really really want and need the surrogate pair behavior, you should transform the strings to and from CESU-8 when passing them into MuJS if you want this behavior. It's opt-in.

PS.

The ES5 spec doesn't have the String codePointAt functions that try to "fix" the problems the surrogate pairs cause. These functions were introduced much later, and are not entirely satisfactory. Try this in a modern Javascript engine and see how the abstraction leaks:

("\{1f600}").codePointAt(0)
("\{1f600}").codePointAt(1)
avih commented 9 months ago

If your code for some unfathomable reason actually needs this surrogate pair behavior for non-BMP characters, you have the option to use CESU-8 when passing strings to and from MuJS

It's not so unfathomable.

Any scenario which takes arbitrary scripts - like the mujs REPL and mupdf and mpv, is one of those which need the conversion unconditionally, because we have to assume that arbitrary scripts assume the spec, and the spec says that non-BMP as JS string (which typically comes from the client and enters the VM via one of the mujs API calls) is two chars.

It's true that most scripts don't care whether non-BMP occupies one or two elements in a string, but for those which do, the behavior is broken.

And why should these people have to pay the transformation costs of encoding and decoding UTF-8 to UTF-16/CESU-8 internally

One reason is that's the spec. If you think some people would prefer to not have this cost because they don't care about this part of the spec, or they know that their use cases doesn't depend on this part of the spec, then maybe it should be an opt-out feature to disable the internal CESU-8/UTF-8 conversion.

Second thing is that it doesn't necessarily involve a conversion, even when the string does include non-BMP chars. The 2nd approach I mentioned, where strings remain UTF-8 inside mujs but counting/indexing/cutting replaces non-BMP codepoints with two virtual surrogate halfs, might have very low maximum cost. Though that's a hunch, as I haven't implemented it.

Speaking on my solution approach with that single-header (converting CESU-8/UTF-8 at the API edges), for the most part it only verifies that there are no non-BMP codepoints at the string - and typically there are none - and that's it.

In the rare cases where conversion is actually needed, most of the time it doesn't even require new memory allocation, as an automatic buffer on the stack is typically enough (at that conversion header the default buffer is 256 bytes).

Only where we both need a conversion AND the automatic buffer is not enough - only then it needs an allocation. And even when it needs an allocation, typically the cost is still very low compared to the general operation of mujs.

So typically the performance impact is simply negligible.

avih commented 9 months ago

Here's a sample API wrapper, for js_getglobal, which ensures that name is CESU-8 when it calls the wrapped mujs API (u8j_cesu8_len(s) returns 0 when s doesn't need conversion to CESU-8).

U8J_API void u_js_getglobal(js_State *J, const char *name) {
    int cesu8_len = u8j_cesu8_len(name);
    if (cesu8_len == 0) {
        js_getglobal(J, name);
    } else {
        if (cesu8_len < U8J_IMPL_STACK_BUF_SIZ) {
            char buf[U8J_IMPL_STACK_BUF_SIZ];
            u8j_write_cesu8(name, buf, cesu8_len);
            name = buf;
            js_getglobal(J, name);

        } else {
            char * volatile mem = u8j__malloc(J, cesu8_len + 1);
            u8j_write_cesu8(name, mem, cesu8_len);
            name = mem;
            if (js_try(J)) {
                u8j__free(J, mem);
                js_throw(J);
            }
            js_getglobal(J, name);
            js_endtry(J);
            u8j__free(J, mem);
        }
    }
}

That's an expansion of a macro, which is used wholesale to wrap dozens of APIs with the same pattern, so, it's the expansion of

U8J__IN_NON(getglobal, (js_State *J, const char *name), name, (J, name))

where IN means it needs to handle an input string, NON means the API has no return value (there's also same macro with INT), the 2nd arg is the prototype args for the wrapper, third is the var name which may need the conversion, and last is the args when calling the wrapped function.

So typically the only cost is at the API call by the client where u8j_cesu8_len(str)is invoked, assuming no conversion is required (not internally when running a script which ends up invoking js_getglobal).

ccxvii commented 9 months ago

Commit 5762138384aae4d5e034dbbd0f514ac2598c4ccf should fix this and expose UTF-8 characters >= U+10000 as UTF-16 surrogate pairs.

avih commented 9 months ago

Commit 5762138 should fix this and expose UTF-8 characters >= U+10000 as UTF-16 surrogate pairs.

Yes, but it looks to me like it only fixes half the problem.

Once such surrogate halfs are joined, I don't think this commit joins them back to UTF-8 so that the string which the client also sees (e.g. with print(str)).

In other words, if a script created non-BMP codepoint from two surrogate halfs, it would not end up UTF-8 for the client.

Do correct me if I'm wrong.

The solution should be that any concatenation of strings (js_concat, Ap_join, and possibly more) should normalize the result to UTF-8.

This can be done in-place, because the result is guaranteed to be allocated and owned by mujs, and UTF-8 is guaranteed never longer than CESU-8.

It's also O(N) at most, in an operation (concat) that is already O(N).

It can even be optimized by testing that each new appended segment doesn't begin with 2nd surrogate half, to exclude the requirement for full pass on the result string, though IMO that's not necessary.