straight-shoota commented 3 years ago

A recently released paper titled Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-directional overrides to disguise the meaning of code to a human reader. This can lead to seemingly harmless code introducing malicious behaviour.

Crystal demonstration

The following code demonstrates a stretched-string attack in Crystal:

access_level = "user"
if access_level != "user‮ ⁦# Check if admin⁩ ⁦"
  puts "You are an admin!"
end

https://carc.in/#/r/c6ka

The following code demonstrates a commenting-out attack in Crystal:

access_level = "user"
if access_level != "none‮⁦" # Check if admin⁩⁦" && access_level != "user
  puts "You are an admin!"
end

https://carc.in/#/r/c6kh

They looks mostly unsuspicious. You wouldn't expect either to print anything. But both programs actually print You are an admin! despite access_level = "user".

The second lines of each program's source code contain a number of Unicode control characters for bi-directional overrides. This is what the parser reads:

# stretched-string attack
if access_level != "user\u202E \u2066# Check if admin\u2069 \u2066"
# commenting-out attack
if access_level != "none\u202E\u2066"# Check if admin\u2069\u2066" && access_level != "user

The only indicator that something might be off is the syntax highlighting, which should be pretty resistant to being fooled. Github has already introduced a feature that shows a warning when bi-directional overrides are detected in a file: https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/

Mitigation

This vulnerability can be defended easily by disallowing bi-directional control characters in source code. In many locations, such control characters are already a syntax error. But they are currently valid in comments and string literals. Those are the typical spots for most languages.

However, Crystal's parser currently even accepts Unicode control characters in identifiers, including bi-directional override characters. Restricting the allowed character set in general is another problem and tracked in #11216.

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

A more fine-grained approach would be possible as well, but this should be unnecessary considering there are little to no legitimate use cases for bidirectional control characters in Crystal source code (but for some specific exceptions mentioned in the following section).

Workarounds

Bi-directional override characters are legitimate contents for string literals. Instead of encoding them directly in the source code, a proper workaround is to use escape sequences for that.

Bi-directional overrides can technically be legitimate in comments if you want to mix languages with different directions in the comment text. That does not seem like a very important use case, though.

Still, as a further enhancement, bi-directional overrides could potentially be allowed in comments and possibly other locations such as string literals as long as they are fully enclosed inside the comment or literal.

The general vulnerability is tracked as CVE-2021-42574.

naqvis commented 3 years ago

I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.

I will argue against this proposal as this is too aggressive. We might take insights from how other languages are handling such cases. Below is an excerpt from how Rust team is doing that:

Mitigations

We will be releasing Rust 1.56.1 today, 2021-11-01, with two new deny-by-default lints detecting the affected codepoints, respectively in string literals and in comments. The lints will prevent source code files containing those codepoints from being compiled, protecting you from the attack.

If your code has legitimate uses for the codepoints we recommend replacing them with the related escape sequence. The error messages will suggest the right escapes to use.

asterite commented 3 years ago

Isn't that what Rust is doing, except that they suggest the escape codes to use?

straight-shoota commented 3 years ago

I don't see much difference, actually. AFAIK there should be no possibilty of such a character being valid anywhere in Crystal source code outside of string literals or comments. As mentioned, in those contexts we could employ a more sophisticated analysis and allow bi-directional control overwrites if they're fully enclosed in the respective literal or comment.

But that is an additional feature and requires more discussion on how to implement.

The primary focus is to apply a defence against this kind of source code modification. Making bi-directional control characters invalid anywhere in a Crystal source file ensures that. With a more fine-grained approach we would need to make sure to identify and cover all vulnerable tokens. The lexer is not very restrictive in many places where you would expect it to be (see #11216). So this would be a real challenge and probably end up with a pretty similar result.

drhuffman12 commented 3 years ago

FWIW, see also python approach for unicode handling for key words: https://stackoverflow.com/questions/41757886/how-can-i-recognise-non-printable-unicode-characters-in-python

fmease commented 3 years ago

I don't see much difference, actually.

The difference between the proposed hard ban of these control characters in Crystal and Rust's implementation is the fact that the latter is a lint which throws an error by default (deny-by-default) but it can be overwritten by the user to not raise any error or to raise only a warning. This can be done at the level of a whole program but also at the level of individual modules, single functions or merely specific expressions. Just FYI.

straight-shoota commented 3 years ago

Yes, I understand Rust uses a different implementation approach. But I see little practical difference. We could consider adding an option ignore or warn instead of failing. In Crystal, however, we do not have a lint system which could offer fine-grained control for where it applies. I figure a global option would be less useful.

But is there even a reasonably valid use case for having bi-directional control characters in your Crystal source code? What would make you disable the bi-direction control character validation?

Fryguy commented 3 years ago

A even better question is whether or not there's any Crystal code out there with bi-directional control characters. Seems extremely unlikely that there is actual, legitimate, usage of this.

beta-ziliani commented 3 years ago

I'd say we take the most aggressive approach, and if we find the community is having legitimate uses of these characters, then we make an opt-out option. (We can do the opt-out right away, but this is more work to maintain.)

beta-ziliani commented 3 years ago

After some back and forth thinking, we decided to give us a time to discuss this properly, and not to rush it in the coming 1.2.2 version.

In a way, we can see three vectors for this type of attack in a language: string literals, identifiers, and comments. With Crystal having only CR-terminating comments, it's not clear that there can be an actual attack using comments, so let's focus on the other two vectors.

When it comes to identifiers, this is already discussed in #11216, and there are more Unicode characters than the bidi control ones that are problematic. I don't think there is much of a debate about that we need to properly handle these.

As for string literals, an option could be to let the format tool to turn bidi control chars into their escape sequence. That will allow us to not be restrictive, and just impose it as a formatting standard.

HertzDevil commented 1 year ago

I created a PR that warns on those unescaped characters, now that parser warnings are available. Then the follow-up would be escaping them whenever possible, which includes more things than string literals. For demonstration purposes, assume that 😂 is U+202E. Then the following:

"😂"
%(😂)
:😂
:"😂"
`😂`

def foo(😂 x)
end

foo(😂: 1)

FOO😂 = 1
%w(😂)

would be formatted into:

"\u202E"
%(\u202E)
:"\u202E" # note the addition of quotation marks
:"\u202E"
`\u202E`

def foo("\u202E" x)
end

foo("\u202E": 1)

FOO😂 = 1 # unchanged because escape sequences are unavailable here
%w(😂)    # ditto

12740 is another example where warnings + formatter are used to phase out deprecated syntax in the language, so we should follow suit here.

straight-shoota commented 1 year ago

Format + warning sounds good to me.

straight-shoota commented 1 year ago

To make it clear for everyone: We're not talking about replacing emojis. 😂 is just a stand-in for a bi-directional control character (U+202E for example).

straight-shoota commented 1 year ago

@HertzDevil What about comments?

# 😂

Should that become this?

# \u202E

Comment's don't actively support unicode escape sequences, but then it's just comments 🤷

beta-ziliani commented 1 year ago

but isn't the problem in this example about bidi chars in comments? That said, it'd be nice if we can still allow certain things like:

p "hello world" # مرحبا بالعالم

HertzDevil commented 1 year ago

I think that if all the problematic characters to the left of # are escaped, then the rendering of the non-comment part is never affected and an attack isn't possible, but I don't have a proof for this yet. If that is the case we don't need to escape anything within comments.

We cannot unconditionally escape all those characters in Crystal code blocks within comments, as that might produce invalid code. And then there are #<loc:...> pragmas which syntactically behave like /* ... */ comments in other languages if one disregards its effects on stack traces; the first argument is a string that ignores escape sequences and can even span across multiple lines.

In any case, even if we don't escape the comments automatically, the compiler and formatter shall warn on them every time anyway. By the way, here is an example of an attack that involves no string literals at all, only comments, made possible because these control characters are allowed in identifiers:

user_id‮⁦ = 1000
if user_id‮⁦ # Check if admin⁩⁦ != 1000
  puts "You are an admin!"
end

The escaped form is:

user_id\u202E\u2066 = 1000
if user_id\u202E\u2066 # Check if admin\u2069\u2066 != 1000
  puts "You are an admin!"
end

straight-shoota commented 1 year ago

For identifiers, I think the compiler should eventually reject bidi control characters (ref #11216). But warning should be good as a start.

HertzDevil commented 1 year ago

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

naqvis commented 1 year ago

There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi

Nice to see 😃

The solution is not to forbid the directional formatting characters; indeed the Ada example above does not use these. This document instead makes two recommendations.

First, source code editors should display source code according to its lexical structure, as described in Section 4.1, Bidirectional Ordering.

Second, computer languages should allow for the insertion of directional formatting characters as described in Section 3.2, Whitespace and Syntax, and implementers should provide tools that automatically remove spurious directional formatting characters, and insert the correct ones, as described in Section 5.2, Conversion to Plain Text.

crystal-lang / crystal

Parser vulnerable to Trojan Source attack #11392

Crystal demonstration

Mitigation

Workarounds

Mitigations

12740 is another example where warnings + formatter are used to phase out deprecated syntax in the language, so we should follow suit here.