Closed danieldaquino closed 5 months ago
Thanks for reporting this and suggesting a fix.
Could you please review the temporary branch https://github.com/dvidelabs/flatcc/tree/json-unicode It adds uint8_t casts to avoid sign extension as suggested. I have not tested, but it does pass regular tests. Also, if possible, please review code to see if there might be other similar cases.
Thank you @mikkelfj! It looks like it will solve the problem.
@jb55, do you have any other thoughts on @mikkelfj's fix? (88e2bc13f5fc6b40405091e6eff449fe0679b1cf)
On Mon, Nov 20, 2023 at 10:47:12AM -0800, Daniel D’Aquino wrote:
Thank you @mikkelfj! It looks like it will solve the problem.
@jb55, do you have any other thoughts on @mikkelfj's fix? (88e2bc13f5fc6b40405091e6eff449fe0679b1cf)
looks fine, I haven't had time to test it yet.
I'm going to close this assuming it works. If not feel free to reopen.
Summary
During development of Damus, we noticed an issue with
flatc
that cause certain flatbuffer values to becomeNULL
when parsing a JSON with Unicode values.I apologize in advance if this report is missing information,
flatc
is still somewhat new to me.Preconditions
flatc version: TBD (@jb55 might have more info on this) Schema: A flatbuffer schema that allows a string value, with certain specific key name size of 4 characters or smaller
Reproduction steps
Minimal failing scenario code
I am working on a failing scenario code for this. Unfortunately I did not have time to make that repro case compile and fully work yet, but hopefully this draft provides some insight as to the nature of the issue: https://github.com/danieldaquino/flatc-unicode-issue-repro
Rough steps
NULL
Detailed root cause information
Note: I will present snippets of Damus code (not the minimal failing case) to more accurately present what we found.
Suppose the program is at this line (marked by an arrow) of the generated JSON parsing code (
(...)_parse_json_table
):Supposed that the text being parsed at the cursor is:
name":"ひらがな(...)
In that case,
buf[7]
will be the first byte ofひ
, which will be 0xE3 (I believe, or some other value where the Most significant bit is1
as it is a Unicode character).This value, being
char
, is interpreted as a signed value, which evaluates to-29
decimal.When the program casts this 8-bit value into a 64-bit value, it ends up padding it with
0xF
s (i.e.0xFFF...E3
)Due to the way
flatcc_json_parser_symbol_part_ext
is currently forming the word inw
,w
will end up as0xFFF...E3
no matter what the other bytes are.Now, when the program reaches this line:
w
will be0xffffffffffffffe3
, therefore it will not match0x6e616d6500000000
("name"), and thus the value will be discarded as if the JSON did not have the "name" value at all.Potential fix
On our program, we fixed this by casting the
char
bytes into unsigned 8-bit values onflatcc_json_parser_symbol_part_ext
.That causes the values to be correctly padded with
0x0
s when transformed into 64-bit valuesHere is the diff that worked in our case: https://github.com/damus-io/damus/pull/1673/commits/3f436cd60ceca2b52a8189f8953e67e993394f61
Other notes
@jb55 and @kunigaku might also have more information and insights on this. @jb55 mentioned he might have a fix that can be applied to
flatc
. Kudos to @kunigaku for finding the root cause and proposing the fix.Hopefully this is enough info, but please let me know if you have more questions 🙏