JSON with Unicode values can cause parsing errors

danieldaquino commented 7 months ago

Summary

During development of Damus, we noticed an issue with flatc that cause certain flatbuffer values to become NULL when parsing a JSON with Unicode values.

I apologize in advance if this report is missing information, flatc is still somewhat new to me.

Preconditions

flatc version: TBD (@jb55 might have more info on this) Schema: A flatbuffer schema that allows a string value, with certain specific key name size of 4 characters or smaller

Reproduction steps

Minimal failing scenario code

I am working on a failing scenario code for this. Unfortunately I did not have time to make that repro case compile and fully work yet, but hopefully this draft provides some insight as to the nature of the issue: https://github.com/danieldaquino/flatc-unicode-issue-repro

Rough steps

Create a flatbuffer schema that allows a string value, with a key name of 4 characters or shorter (e.g. "name")
Create a JSON where the value of such key is non-english unicode text (e.g. "ひらがな" in UTF-8 encoding)
Try to parse this JSON into a flatbuffer
Check the value of "name" in the flatbuffer object. Should be "ひらがな", but if the issue is present it will show as NULL

Detailed root cause information

Note: I will present snippets of Damus code (not the minimal failing case) to more accurately present what we found.

Suppose the program is at this line (marked by an arrow) of the generated JSON parsing code ((...)_parse_json_table):

if (w == 0x646973706c61795f) { /* descend "display_" */
                        buf += 8;
---->               w = flatcc_json_parser_symbol_part(buf, end);
                        if ((w & 0xffffffff00000000) == 0x6e616d6500000000) { /* "name" */

static inline uint64_t flatcc_json_parser_symbol_part(const char *buf, const char *end)
{
    size_t n = (size_t)(end - buf);

#if FLATCC_ALLOW_UNALIGNED_ACCESS
    if (n >= 8) {
        return be64toh(*(uint64_t *)buf);
    }
#endif
---->    return flatcc_json_parser_symbol_part_ext(buf, end);
}

static inline uint64_t flatcc_json_parser_symbol_part_ext(const char *buf, const char *end)
{
    uint64_t w = 0;
    size_t n = (size_t)(end - buf);

    if (n > 8) {
        n = 8;
    }
    /* This can bloat inlining for a rarely executed case. */
#if 1
    switch (n) {
    case 8:
---> w |= ((uint64_t)buf[7]) << (0 * 8);
        goto lbl_n_7;
    case 7:
lbl_n_7:
        w |= ((uint64_t)buf[6]) << (1 * 8);
        goto lbl_n_6;
    case 6:
lbl_n_6:
        w |= ((uint64_t)buf[5]) << (2 * 8);
        goto lbl_n_5;
    case 5:
lbl_n_5:
        w |= ((uint64_t)buf[4]) << (3 * 8);
        goto lbl_n_4;
    case 4:
lbl_n_4:
        w |= ((uint64_t)buf[3]) << (4 * 8);
        goto lbl_n_3;
    case 3:
lbl_n_3:
        w |= ((uint64_t)ubuf[2]) << (5 * 8);
        goto lbl_n_2;
    case 2:
lbl_n_2:
        w |= ((uint64_t)ubuf[1]) << (6 * 8);
        goto lbl_n_1;
    case 1:
lbl_n_1:
        w |= ((uint64_t)ubuf[0]) << (7 * 8);
        break;
    case 0:
        break;
    }
#else
    /* But this is hardly much of an improvement. */
    {
        size_t i;
        for (i = 0; i < n; ++i) {
            w <<= 8;
            if (i < n) {
                w = buf[i];
            }
        }
    }
#endif
    return w;
}

Supposed that the text being parsed at the cursor is: name":"ひらがな(...)

In that case, buf[7] will be the first byte of ひ, which will be 0xE3 (I believe, or some other value where the Most significant bit is 1 as it is a Unicode character).

This value, being char, is interpreted as a signed value, which evaluates to -29 decimal.

When the program casts this 8-bit value into a 64-bit value, it ends up padding it with 0xFs (i.e. 0xFFF...E3)

Due to the way flatcc_json_parser_symbol_part_ext is currently forming the word in w, w will end up as 0xFFF...E3 no matter what the other bytes are.

Now, when the program reaches this line:

---> if ((w & 0xffffffff00000000) == 0x6e616d6500000000) { /* "name" */
                 buf = flatcc_json_parser_match_symbol(ctx, (mark = buf), end, 4);
                            if (mark != buf) {
                                buf = flatcc_json_parser_build_string(ctx, buf, end, &ref);

w will be 0xffffffffffffffe3, therefore it will not match 0x6e616d6500000000 ("name"), and thus the value will be discarded as if the JSON did not have the "name" value at all.

Potential fix

On our program, we fixed this by casting the char bytes into unsigned 8-bit values on flatcc_json_parser_symbol_part_ext.

That causes the values to be correctly padded with 0x0s when transformed into 64-bit values

Here is the diff that worked in our case: https://github.com/damus-io/damus/pull/1673/commits/3f436cd60ceca2b52a8189f8953e67e993394f61

Other notes

@jb55 and @kunigaku might also have more information and insights on this. @jb55 mentioned he might have a fix that can be applied to flatc. Kudos to @kunigaku for finding the root cause and proposing the fix.

Hopefully this is enough info, but please let me know if you have more questions 🙏

mikkelfj commented 7 months ago

Thanks for reporting this and suggesting a fix.

Could you please review the temporary branch https://github.com/dvidelabs/flatcc/tree/json-unicode It adds uint8_t casts to avoid sign extension as suggested. I have not tested, but it does pass regular tests. Also, if possible, please review code to see if there might be other similar cases.

danieldaquino commented 7 months ago

Thank you @mikkelfj! It looks like it will solve the problem.

@jb55, do you have any other thoughts on @mikkelfj's fix? (88e2bc13f5fc6b40405091e6eff449fe0679b1cf)

jb55 commented 7 months ago

On Mon, Nov 20, 2023 at 10:47:12AM -0800, Daniel D’Aquino wrote:

Thank you @mikkelfj! It looks like it will solve the problem.

@jb55, do you have any other thoughts on @mikkelfj's fix? (88e2bc13f5fc6b40405091e6eff449fe0679b1cf)

looks fine, I haven't had time to test it yet.

mikkelfj commented 5 months ago

I'm going to close this assuming it works. If not feel free to reopen.

dvidelabs / flatcc