Open Quuxplusone opened 6 years ago
Bugzilla Link | PR38870 |
Status | NEW |
Importance | P enhancement |
Reported by | James Y Knight (jyknight@google.com) |
Reported on | 2018-09-07 11:46:12 -0700 |
Last modified on | 2018-09-07 14:56:29 -0700 |
Version | unspecified |
Hardware | PC Linux |
CC | llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk |
Fixed by commit(s) | |
Attachments | |
Blocks | |
Blocked by | |
See also |
I think the first two categories make a lot of sense to warn about in the
compiler. (I suspect we don't need the full table, though, since we only need
to warn on characters that are both accepted as identifier characters by some
supported language and not classified as ID_Continue by Unicode; I'd hope
that's a substantially shorter list.)
The third one seems like a stylistic warning, though, and I think it belongs in
clang-tidy or another style-checker instead (I could equally imagine someone
wanting a check that all identifiers declared outside of system headers are in
some other character set, and it doesn't seem appropriate to give Latin1
special treatment).
I'd prefer to have a specific warning for the special case of invisible
characters, just as we have a specific warning for the special case of
homoglyphs that look like operators in some fonts, mostly so that we can give a
more precise diagnostic. I've added that in r341700; the broader issue of
warning on non-ID_Continue characters remains.
The -Wflag=value style is not really something we do in Clang warning flags
(except for GCC compatibility); we'd want a different name for the second flag.
It would also make sense to warn on identifiers that are not in some specific normalization form (perhaps NFC?); the C and C++ rules require us to treat normalized and non-normalized versions of the same identifier as distinct(!) so we can't normalize as part of forming the identifier if we want to conform to those rules.
What method did you use to generate the list in r341700? I think there's substantially more likely-invisible characters than are listed there (for ex, the entire "TAG" block, E0000..E007F).
For that matter, there's a whole lot more symbols which look like ascii symbols, which are not listed in the current homoglyph list.
It doesn't seem clear to me that maintaining a full set of homoglyph/invisible characters by hand would really be worthwhile, versus taking the Unicode recommendations, which almost entirely supersedes them.
The only entry in the current homoglyph/invisibles list which won't be excluded by an ID_Continue check is "LATIN LETTER RETROFLEX CLICK" (because it is a "letter" in the Latin script).
(In reply to James Y Knight from comment #3)
> What method did you use to generate the list in r341700? I think there's
> substantially more likely-invisible characters than are listed there (for
> ex, the entire "TAG" block, E0000..E007F).
It's all characters in category Cf, excluding characters used for language-
specific purposes (such as ARABIC LETTER MARK and various bidirectional
markers) and the TAG block (whose contents are not actually invisible
characters in general, and act more like combining characters for forming flag
emoji, for example).
> For that matter, there's a whole lot more symbols which look like ascii
> symbols, which are not listed in the current homoglyph list.
Yes, this list intentionally omits characters that may be intentionally used in
identifiers across various current languages.
> It doesn't seem clear to me that maintaining a full set of
> homoglyph/invisible characters by hand would really be worthwhile, versus
> taking the Unicode recommendations, which almost entirely supersedes them.
The point of the list is to give high-quality diagnostics for common situations
-- specifically, greek question marks and (now) non-breaking spaces. A list of
characters doesn't let us give as good a diagnostic experience, and it's much
more important to give an excellent experience for these more-common cases than
to give some warning for rare cases.