Mitigating underhanded code

We should decide to what extent and in what ways we want to mitigate the risk of underhanded Carbon code -- code that intentionally behaves in a manner different from its appearance.

See https://github.com/carbon-language/carbon-lang/pull/17/files#r428333473 for a case where this question arose.

Thanks for splitting this off into a separate issue!

In https://github.com/carbon-language/carbon-lang/pull/17#discussion_r428333473, you wrote:

We need to decide to what extent we want to worry about the second class of problems. I could certainly imagine changes we could make to the specification to avoid these problems -- restricting source files to ASCII eliminates a lot of them -- but those changes come with associated costs, some of which would directly harm our inclusivity goal (for example: you cannot write identifiers -- or string literals! -- in your native language). For a project that expects to see only English-language identifiers and comments, restricting source files to ASCII might be a totally reasonable option. My inclination is that that should be a project-wide decision, though, not a language-wide one.

Yeah. Most programming languages today just let users deal with this on their own, with the idea that a project that wants to ban non-ASCII characters can make their own commit hook or whatever.

I don't think this is an effective solution in practice. Although language designers think of it as delegating the choice to individual projects, what actually happens is projects don't do anything and thus remain insecure. Even worse, projects that use non-English languages don't even have the option to be secure! This doesn't feel very inclusive.

I don't have an answer here, but I'd love to brainstorm some possibilities. Here's one potential direction:

Assuming we have some sort of package manager, we could require each package to declare a list of (natural) languages they'll allow in identifiers, strings, and comments. We'd probably only support a small set of languages at first; for each of those, we'd have some sort of rules for valid character sequences that avoid confusion with language syntax or with other valid text in the set of allowed languages. (We could use looser rules for comments, which only care about confusability with language syntax; in strings we probably also care about at least confusability with ASCII, so we don't end up with programs that look like they're printing a particular HTTP header or shell command but are actually printing something else. For identifiers we need full confusability detection.) The rules would be enforced by the compiler.

Not all combinations of languages would be supported, so we don't need to analyze the full matrix of language pairs. For right-to-left languages, in addition to restricting which characters can be used, we'd require the user to add an LTR mark after any string or identifier where it would affect the display ordering. (This might be annoying, but it has some big benefits -- not only does it prevent security issues, it fixes all the display problems that frequently come up even in legitimate code using right-to-left text. Let me know if you want some examples of these; they're pretty terrible.)

Note that our syntax choices affect how permissive we can be. For example, if we have C-style /* comments, we need to ban any text inside them that could simulate a */ sequence, or hide the real */ sequence, or cause text to be reordered across the */. But if we only have // comments, we can be much more generous.

Does this seem like a plausible direction? And more generally, is this a problem worth thinking about? I'd love to hear people's thoughts.

This sounds like a promising idea. We should find people with internationalization expertise to discuss this with -- perhaps some of the Unicode folks -- to make sure that what we're considering is reasonable. Certainly as a mechanism to avoid homograph issues, limiting the set of available characters seems like a plausible step to take, and we could augment that with a manual list of confusables if our language support extends to a language with such characters.

I'm not sure whether LTR marks are enough to force the context enclosing an RTL string literal to be treated as LTR text. As an alternative, we could run the Unicode bidi algorithm on the source text and verify that it infers left-to-right directionality for all text outside comments and string literals -- that way we don't need to care how the user ensures that the non-text source code is interpreted as left-to-right. However, it's not clear to me whether the directionality portion of this issue should properly be viewed as an encoding issue or as a display issue. A smart renderer could tokenize the source code and ensure that tokens are always displayed left-to-right (or always right-to-left, if that is the viewer's preference), with comments and string literals treated as independent bodies of text. Would we be working around a deficiency in context-unaware text editors by requiring explicit LTR marks (or isolates or whatever else), or is this simply the right way to do it?

In any case, to the extent that we can do so, I'd like for us to delegate the thinking and approach for this problem to another group. I don't think Carbon should be defining its own language profiles -- that really seems like something that some other group should be providing for us and that we should only be consuming. Does anything like this already exist?

My two cents:

1) break this down into contexts where different things need to apply 2) use reasonably tiered rules for these contexts to make the scope of damage contained and tractable

So lets consider contexts from what I think would be least restrictive toward those that are probably most restrictive:

multiline string literals & comments
single line string literals & comments
identifiers
fixed language syntax

For multiline strings and comments, I think that anything must be OK. And we should design the delimiters with this reality in mind. Notably, in both cases, I think that the start and end should be both vertically and horizontally identifiable. I believe the direction is already going in this way, so I'm not worried here. Put differently, the context doesn't start until a new line, and it ends before the line of the end delimiter. This seems like a strong mitigation for making sure the delimiters can be recognized no matter the contents.

For single-line, I think that we need to block all vertical breaks and/or whitespace. I think Unicode already gives us strong categorizations for this, and we always have the multiline context to allow it. The result should be that again, the delimiters can be engineered to be distinguishable from the content.

For identifiers, I think Unicode already gives us strong sets of characters that can be meaningfully delimited by both whitespace and the punctuation set we are using.

I think the interesting questions arise for Carbon's fixed syntax. Here, we don't really have a compelling i18n story, and I'm not aware of any mainstream languages that do... They all pick a set of linguistic rules and adhere to them. While I'm open to adopting innovations in this space if & when they arise, I'm not personally motivated to try to do that innovation with Carbon.

Instead, I would suggest making this context extremely limited. I think we should allow exactly the set of whitespace that we design the fixed syntax to interact with, and disallow everything else. I don't think this will be a burden because the context is only that of the fixed syntax of the language. I honestly think we might want to go so far as to preclude tabs, and just use spaces and newline characters. The ambiguity and lack of consistency seems to dramatically outweigh the benefits of allowing this flexibility (to me). (The only alternative that seems appealing is to insist on tabs for indentation rather than spaces.) I think the result is that we pick a near-lowest-common-denominator character set (including whitespace!) for the fixed syntax.

Concerns like RTL seem well handled with this kind of approach because the RTL would be within one of these contexts, not outside of it. Admitedly, it isn't beautifully internationalized through this approach, but I don't think anything short of innovating toward internationalized syntax will accomplish that goal. And it should still allow arbitrary alphabet usage within the user-defined contexts, with minimal restrictions necessary to delimit those contexts.

WDYT?

I don't think that approach prevents us from having source code that appears to do one thing but does another -- the Unicode bidirectional algorithm knows nothing about our different syntactic contexts. If it sees

var String: x = "مرحبا" + "بالعالم";

then it might look like the string starts with "بالعالم" and ends with "مرحبا", but that is not the case: the string starts with "مرحبا" and ends with "بالعالم". The reason is that all of the text between the first " and the last ", including the +, is displayed right-to-left. (The ", +, and space characters all have weak or neutral directionality and do not force a switch to left-to-right.)

In order to fix this, we can ask users to insert explicit left-to-right marks after a right-to-left string. That would then display properly:

var String: x = "مرحبا"‎ + "‎بالعالم";

... but would require that we treat such marks as whitespace, which is an option suggested by Unicode TR31. (Effectively, this approach would amount to "we will verify that the code does what it looks like it does, but making it look right is the user's problem not ours".)

If we want to force the matter, we could do so by, for example, including a character with strong left-to-right directionality as part of at least the closing delimiter of any context that contains right-to-left text. For example, we could permit an optional (no-op) trailing character with strong LTR ordering (say r for "RTL") on string literals. Then the above example could be written as:

var String: x = "مرحبا"r + "بالعالم"r;

with no need for any invisible LTR marks. And similarly we could require identifiers to end with a character with strong LTR directionality.

Perhaps we could combine this with my previous suggestion:

Reject if the Unicode bidi algorithm would not order the tokens left-to-right.
Allow an optional suffix character on character and string literals to make it easy to return to an LTR context without resorting to an LTR mark.

This still allows some amount of underhandedness, for example:

var auto: x = ‏"==0".size()‎;

... might look like it's comparing 0 against "".size(), but the source code is actually this:

var auto: x = <RTL mark>"==0".size();

... so x is 3, not True. I think this too can be handled by requiring the opening delimiter of a context in which we permit RTL text be classified as left-to-right. But someone would need to study the Unicode bidi algorithm to be sure whether that's sufficient. Disallowing LTR and RTL marks outside of literals and comments does seem like a safer choice.

But this is far from my area of expertise; we should find someone who knows this stuff deeply and ask.

Agreed about finding an expert to help here, or at least running our ideas by some experts.

But I really like forbidding LTR and RTL marks outside of specific contexts where they can be well handled. In essence, I continue to like fixed-syntax-whitespace (IE, outside of any comment or literal) to be an extremely limited character set.

I also really like requiring an explicit LTR demarkation (so not a whitespace LTR mark) at the end of any context which contains non-LTR sequences.

This goes to the principle I'm suggesting of creating unambiguous delimiters for contexts where we enable more things. So there might be an LTR-string-literal context that uses just ", but a more general string-literal context that allows RTL but uses a delimiting style (much like your suggestion) that forces the delimiters themselves to be unambiguous and definitive when correctly rendered.

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time. This issue is labeled inactive because the last activity was over 90 days ago.

carbon-language / carbon-lang

Mitigating underhanded code #19