dvidelabs / flatcc

FlatBuffers Compiler and Library in C for C
Apache License 2.0
631 stars 180 forks source link

The comment contains UTF8 characters. #267

Closed xianglin1998 closed 8 months ago

xianglin1998 commented 9 months ago

If the comment contains UTF8 characters, an error 'unexpected control character in comment' will be reported.

image

Debug is ultimately traced back to the lex_emit_comment_ctrl macro definition, commenting out the code that was judged to be empty can temporarily work.

image

I tried Google's flat compiler, and they can support comments on UTF8 encoding very well. Enabling comments to support UTF8 encoding can enable flatcc to support more flatbuffers IDL files. Is there an opportunity to embrace UTF8 coding comments with as few changes as possible?

Thank you to the contributors of Flatcc.

mikkelfj commented 8 months ago

Thanks for this report. I has been a long time since I wrote this code so I need to understand what is happening more precisely.

I will definitely not rule out a bug in the parser. But it could also be that you have invalid control characters in the comment. UTF-8 does not normally contain characters codes 0..31 decimal unless they are ASCII control characters like tab or newline.

Can you provide an example with exactly the comment triggering the problem?

My guess is that the parser consumes pseudo UTF-8: ASCII UTF: anything above 32 decimal or space (tab or CR/LF). Non-ASCII UTF-8: anothing above 127 followed by more characters above 127. This matches a superset of UTF-8 that is easy to parse.

Control characters 0-31 should only appear as ASCII characters and not in some other UTF-8 character sequence.

The parser chooses to reject control characters in comments because they are not printable (in any language).

xianglin1998 commented 8 months ago

Such as: // 这是一段注释

xianglin1998 commented 8 months ago

You can copy this comment: // 导入

Convert '导' to hex is 0xe5...

The isblank can't accept 0xe5, because that are not \t and space char, so, it'll break parse.

xianglin1998 commented 8 months ago

The comment line is Chinese, you can try to convert Chinese to hex. to trace back this problem.

mikkelfj commented 8 months ago

I think it is 0xe8 not 0xe5 from the sample you gave above. Either way, this should not trigger a control character error, so I will look into this.

Being danish, I can also reproduce with:

// æøå while // abc does not trigger any error (as expected).

xianglin1998 commented 8 months ago

I convert '导入' to hex is 0xE5AFBCE585A5

mikkelfj commented 8 months ago

Strange, something changed in copy paste. It really isn't important here, but FYI:

cat sample.fbs
// 这是一段注释

flatcc-issue-267 flatcc-issue-267*​ ❯ hexdump -C sample.fbs
00000000 2f 2f 20 e8 bf 99 e6 98 af e4 b8 80 e6 ae b5 e6 |// .............| 00000010 b3 a8 e9 87 8a 0a |......| 00000016

mikkelfj commented 8 months ago

Ah, it was just two different text examples. No worries.

mikkelfj commented 8 months ago

Please test on latest master branch. There were two issues: comment_ctrl means either a comment or a control character, so the correct check should be

#define lex_emit_comment_ctrl(pos)                                          \
    if (lex_isblank(*pos) || !lex_isctrl(*pos)) {                           \
        push_comment((fb_parser_t*)context, pos, pos + 1);                  \
    } else {                                                                \
        push_token((fb_parser_t*)context, LEX_TOK_CTRL,                     \
                pos, pos + 1);                                              \
    }

but lex_isctrl also had a missing unsigned cast, so luther.c needed to fix lex_isctrl too (and some other similer macros):

#define lex_isctrl(c) (((unsigned)c) < 0x20 || (c) == 0x7f)
xianglin1998 commented 8 months ago

Perfect, i'll test it now.

xianglin1998 commented 8 months ago

After testing, the latest patch has worked very well. Thank you for your contribution.

xianglin1998 commented 8 months ago

This issue has been fixed, let me close it. Haha.