unpredicted behavior when code contains Japanese comments (apparently encoded in Shift+JIS)

udif commented 1 month ago

Describe the bug Apparently slang is confused by Japanese comments encoded in Shift+JIS

To Reproduce Unfortunately I don't have yet a minimal example based on non-proprietary code.

Additional context I ran slang against some proprietary code from Japan and it crashed with errors such as port 'XYZ' has no I/O member declaration. Please notice that all identifiers are in English, it's just the comments that are in Japanese. In addition, the error occurs only when they are encoded as Shift+JIS. If I reencode the same file in UTF-8, there are no errors. iconv -f SHIFT-JIS -t UTF-8 jpn_shiftJIS_err.v >jpn_utf8_noerr.v When I reencode the UTF8 file back into Shift+JIS, the error returns. iconv -t SHIFT-JIS -f UTF-8 jpn_utf8_noerr.v >jpn_back_to_shiftJIS.v I suspect that slang fails to detect end of line when the newline falls in such places that due to Shift+JIS encodings would be interpreted as the middle of a UTF-8 codepoint. Needless to say, the code passes Xcelium (zero warnings) and according to the team who wrote the code also passes SpyGlass.

MikePopoloski commented 1 month ago

This is not surprising to me; slang only supports UTF-8 inputs. Giving it text in any other encoding is likely to break something. Extending slang to support other encodings would be a fair amount of work.

udif commented 1 month ago

I took a look at lexer.cpp, more specifically, scanLineCommen(), scanBlockComment(), scanUTF8Char() and utf8Decode(). Given that:

All UTF8 filtering is done inside slang (no external libraries)
UTF-8 is not really mentioned in IEEE1800 (At least searching the PDF for UTF revealed zero hits).
All multibyte UTF-8 codepoints has 1 as their most significant bit.
(Specifically for my test case) Legal Shift+JIS only has a 0x0a byte as newline in the 1st byte. The 2nd byte in a multibyte char cannot be 0x0a.

I think that what this implies is that if 0x0a is detected in the middle of a candidate multibyte UTF-8 codepoint, it should be safely identified as 0x0a without any risk of misinterpreting a legal UTF-8 codepoint. I will try running it in a debugger to see what is the exact cause of the failure, and I'm willing to try writing a PR to fix it, perhaps with a new command line flag if you prefer.

MikePopoloski commented 1 month ago

Hmm, I'm skeptical of hacking this in specifically just for newlines and just for Shift-JIS. I think there's a reasonably straightforward change which is to add an option to run iconv (and the Windows equivalent) internally when buffers are loaded into memory. That doesn't deal with slang outputs but assuming you don't care if all outputs are in UTF-8 form then maybe it's sufficient.

You're right that the LRM doesn't say much about encoding. It basically says all SystemVerilog text is ASCII, and slang supports full UTF-8 since it's a pure extension of ASCII. I'm not against having a --input-charset option.

udif commented 1 month ago

(I'm sorry, I didn't mean to, but it ended a bit longer than what I planned)

I see your point about using iconv, but my claim is that checking for newlines inside UTF8 is minimal, safe (does not make things worse), and different (but sometime better) when it comes to displaying error messages.

Safety

We can divide codepages into groups:

7-bit ASCII This is safe before and after the change.
8-bit ASCII (old DOS era codepages for international languages) Interpreting this as UTF-8 could skip newlines if 0b11xxxxxx codes are used. However detecting newlines would safely terminate single-line comments.
Fixed length multibyte codepages such as UTF-16 This is irrelevant since they cannot represent Verilog code anyhow. We are only interested in codepages that allows ASCII for code + international language for comments
8-bit variable length multibyte codepages that guarantee no control chars (less than 0x20) are used (As far as I can tell, Shift+JIS falls under that category) Interpreting this as UTF8 would fail as encountering a 0b1110xxxx code would be interpreted as a 3-byte sequence UTF8 while the actual codepage might interpret it as 2-byte, followed by a newline which would be missed.
8-bit variable length multibyte codepages that may contained embedded control characters below 0x20. (I'm not actually aware of such codes. I searched for common Chinese codepages and came up with GB-18030, which also allows bytes less than 0x7f only on the 1st byte). if such a codepage exists, it seems UTF-8 detection would fail even if we check for newlines within UTF-8 codepoints, but then we are not in any worse position than what we were in the first place.

To summarize, checking for end-of-line in the middle of invalid UTF-8 characters is always better or no worse than the current situation (You cannot have any byte below 0x80 within any UTF-8 multibyte character).

Error messages

As for displaying error messages, if this is the user's own code, his terminal is probably already configured for that language and converting from it to UTF-8 might make the situation worse for them. If this is just imported IP code in a foreign language, I would probably not benefit from the comment anyhow.

Goals achieved

Be on-par with commercial tools (you always get that "we don't see any problems with that code!" from your coworkers).
Minimal changes
Minimal runtime impact - check only when an illegal UTF-8 is detected
Minimal configuration changes (with iconv you have to figure out which codepage is actually used, and perhaps want an option to limit iconv to a subset of those files that needs it).

If you still prefer the iconv approach (I think it's a bit of an overkill) I'll go with it though.

Current debugging state: I minimized my testcase to 4 lines (2 input ports on an empty module) but I still need to check that the comments are generic enough to be posted. I'll have to iconv it to UTF-8 and use google translate to check the result.

I also started debugging it, stopping it inside scanLineComment when newline is received or when an illegal UTF-8 is detected. I noticed that at some point is simply skipped until the end of the file, but haven't found the exact issue yet.

udif commented 1 month ago

One more thing: My suggestion above doesn't require you to know in advance what codepage is your file. With far east languages there are so many different codepages, you can't really easily tell that esp. if you don't speak the language. So not only you have to know in advance the codepage and tell that, if you have multiple files with different languages you need to be able to indicate which file has specific encoding.

Here is a minimized example I'm comfortable sharing:

cat - | xxd -r >t41h.v
00000000: 6d6f 6475 6c65 2049 3b0a 2f2f 8393 834f  module I;.//...O
00000010: 0a2f 2f82 a982 e782 cc83 4c83 6283 4e90  .//.......L.b.N.
00000020: 4d8d 8628 935d 9197 8a4a 8e6e 290a 656e  M..(.]...J.n).en
00000030: 646d 6f64 756c 650a                      dmodule.

The issue with this file is not the Unicode warnings, but the failure to detect the endmodule keyword.

MikePopoloski commented 1 month ago

I see what you're saying; just modify the utf8 failure path to count some of the bytes as newlines. I think that's fine if you want to give it a shot.

udif commented 1 month ago

I gave this some time today.
It seems that the example I've gave has a different failure mechanism. The UTF8 warnings are counted as Lexer errors, and after a certain threshold is met, (options.maxErrors on Lexer.cpp:237) parsing is aborted. I already have a patch for my original failure mechanism but it seems this is not what triggers this specific failure.

udif commented 1 month ago

While the PR above solves the unit test added and the test file above, it does not solve my real-life issue ☹️ Running the full-length file with -Wno-invalid-source-encoding and the PR applied, I still get errors. I will use the PR to trim down the file again and find the new failure mode. Regardless, I still think this PR can be applied.

udif commented 1 month ago

OK, previous PR was simple incorrect. In case of a 3-byte malformed UTF when a newline was on the 2nd byte, the token length was not correctly reduced to 1. I added an additional test, which passes after the modified fix, and so is my original file where I first encountered this.

MikePopoloski commented 1 month ago

Is this issue fully resolved now by the PR that was landed?

udif commented 1 month ago

Yes!

MikePopoloski / slang