Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

Add warnings for unusual unicode identifiers #37843

Open Quuxplusone opened 6 years ago

Quuxplusone commented 6 years ago
Bugzilla Link PR38870
Status NEW
Importance P enhancement
Reported by James Y Knight (jyknight@google.com)
Reported on 2018-09-07 11:46:12 -0700
Last modified on 2018-09-07 14:56:29 -0700
Version unspecified
Hardware PC Linux
CC llvm-bugs@lists.llvm.org, richard-llvm@metafoo.co.uk
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also
The use of unicode in source code can cause great confusion to readers. This is
true no matter what, but it is rather exacerbated by the fact that the
C++11/C11 standards have a syntax for identifiers which does not follow the
unicode consortium's recommendation for identifiers
(http://www.unicode.org/reports/tr31/)

Notably, the C/C++ syntax allows a bunch of codepoints which are invisible
format characters (in the unicode character class "Cf"). One instance which I
just ran into in actual source code (causing great consternation!) is U+200b
ZERO WIDTH SPACE. In multiple cases in our codebase, developers have managed to
type (or copy/paste) that character into an identifier declaration, and then
used editor completion to copy that mistake into the uses of the name, too.

I believe clang should have warnings for this -- both an default-on warning for
the characters that really shouldn't be present at all, and some optional
warnings for further restricting the character-set used.

I propose 3 new warnings. None of these should warn on the use of \u escapes in
an identifier, *only* for unicode characters literally in the source code.

-Wunicode-identifier-unusual:
  Warn on the use of a non-escaped character in an identifier which is not a usual identifier character (per Unicode Consortium UAX#31's ID_Continue). This will warn on the usage of invisible format characters, amongst others. Enabled by default.

-Wunicode-identifier-unusual=2:
  Warn on the use of a non-escaped character in an identifier which are not valid in UAX#31's ID_continue table, and on those which are listed in the additional candidates for exclusion table.

-Wunicode-identifier:
  Warn on the use of any unescaped non-ascii character in an identifier.

An ID_Continue table has 707 ranges, and a table of excluded ranges for the
additional candidates to exclude for the 2nd level warning is somewhere around
230 ranges.

Thus, this would probably add about 7K of extra static data to clang, which
doesn't seem an unreasonable amount.

While checking the codebase for this sort of issue, some instances in string
constants were also found -- and although one such instance was a bug, I don't
believe strange characters in string constants is as obviously wrong, so I
don't propose to do anything about that.
Quuxplusone commented 6 years ago
I think the first two categories make a lot of sense to warn about in the
compiler. (I suspect we don't need the full table, though, since we only need
to warn on characters that are both accepted as identifier characters by some
supported language and not classified as ID_Continue by Unicode; I'd hope
that's a substantially shorter list.)

The third one seems like a stylistic warning, though, and I think it belongs in
clang-tidy or another style-checker instead (I could equally imagine someone
wanting a check that all identifiers declared outside of system headers are in
some other character set, and it doesn't seem appropriate to give Latin1
special treatment).

I'd prefer to have a specific warning for the special case of invisible
characters, just as we have a specific warning for the special case of
homoglyphs that look like operators in some fonts, mostly so that we can give a
more precise diagnostic. I've added that in r341700; the broader issue of
warning on non-ID_Continue characters remains.

The -Wflag=value style is not really something we do in Clang warning flags
(except for GCC compatibility); we'd want a different name for the second flag.
Quuxplusone commented 6 years ago

It would also make sense to warn on identifiers that are not in some specific normalization form (perhaps NFC?); the C and C++ rules require us to treat normalized and non-normalized versions of the same identifier as distinct(!) so we can't normalize as part of forming the identifier if we want to conform to those rules.

Quuxplusone commented 6 years ago

What method did you use to generate the list in r341700? I think there's substantially more likely-invisible characters than are listed there (for ex, the entire "TAG" block, E0000..E007F).

For that matter, there's a whole lot more symbols which look like ascii symbols, which are not listed in the current homoglyph list.

It doesn't seem clear to me that maintaining a full set of homoglyph/invisible characters by hand would really be worthwhile, versus taking the Unicode recommendations, which almost entirely supersedes them.

The only entry in the current homoglyph/invisibles list which won't be excluded by an ID_Continue check is "LATIN LETTER RETROFLEX CLICK" (because it is a "letter" in the Latin script).

Quuxplusone commented 6 years ago
(In reply to James Y Knight from comment #3)
> What method did you use to generate the list in r341700? I think there's
> substantially more likely-invisible characters than are listed there (for
> ex, the entire "TAG" block, E0000..E007F).

It's all characters in category Cf, excluding characters used for language-
specific purposes (such as ARABIC LETTER MARK and various bidirectional
markers) and the TAG block (whose contents are not actually invisible
characters in general, and act more like combining characters for forming flag
emoji, for example).

> For that matter, there's a whole lot more symbols which look like ascii
> symbols, which are not listed in the current homoglyph list.

Yes, this list intentionally omits characters that may be intentionally used in
identifiers across various current languages.

> It doesn't seem clear to me that maintaining a full set of
> homoglyph/invisible characters by hand would really be worthwhile, versus
> taking the Unicode recommendations, which almost entirely supersedes them.

The point of the list is to give high-quality diagnostics for common situations
-- specifically, greek question marks and (now) non-breaking spaces. A list of
characters doesn't let us give as good a diagnostic experience, and it's much
more important to give an excellent experience for these more-common cases than
to give some warning for rare cases.