Open straight-shoota opened 3 years ago
I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.
I will argue against this proposal as this is too aggressive. We might take insights from how other languages are handling such cases. Below is an excerpt from how Rust team is doing that:
Mitigations
We will be releasing Rust 1.56.1 today, 2021-11-01, with two new deny-by-default lints detecting the affected codepoints, respectively in string literals and in comments. The lints will prevent source code files containing those codepoints from being compiled, protecting you from the attack.
If your code has legitimate uses for the codepoints we recommend replacing them with the related escape sequence. The error messages will suggest the right escapes to use.
Isn't that what Rust is doing, except that they suggest the escape codes to use?
I don't see much difference, actually. AFAIK there should be no possibilty of such a character being valid anywhere in Crystal source code outside of string literals or comments. As mentioned, in those contexts we could employ a more sophisticated analysis and allow bi-directional control overwrites if they're fully enclosed in the respective literal or comment.
But that is an additional feature and requires more discussion on how to implement.
The primary focus is to apply a defence against this kind of source code modification. Making bi-directional control characters invalid anywhere in a Crystal source file ensures that. With a more fine-grained approach we would need to make sure to identify and cover all vulnerable tokens. The lexer is not very restrictive in many places where you would expect it to be (see #11216). So this would be a real challenge and probably end up with a pretty similar result.
FWIW, see also python approach for unicode handling for key words: https://stackoverflow.com/questions/41757886/how-can-i-recognise-non-printable-unicode-characters-in-python
I don't see much difference, actually.
The difference between the proposed hard ban of these control characters in Crystal and Rust's implementation is the fact that the latter is a lint which throws an error by default (deny-by-default) but it can be overwritten by the user to not raise any error or to raise only a warning. This can be done at the level of a whole program but also at the level of individual modules, single functions or merely specific expressions. Just FYI.
Yes, I understand Rust uses a different implementation approach. But I see little practical difference. We could consider adding an option ignore or warn instead of failing. In Crystal, however, we do not have a lint system which could offer fine-grained control for where it applies. I figure a global option would be less useful.
But is there even a reasonably valid use case for having bi-directional control characters in your Crystal source code? What would make you disable the bi-direction control character validation?
A even better question is whether or not there's any Crystal code out there with bi-directional control characters. Seems extremely unlikely that there is actual, legitimate, usage of this.
I'd say we take the most aggressive approach, and if we find the community is having legitimate uses of these characters, then we make an opt-out option. (We can do the opt-out right away, but this is more work to maintain.)
After some back and forth thinking, we decided to give us a time to discuss this properly, and not to rush it in the coming 1.2.2 version.
In a way, we can see three vectors for this type of attack in a language: string literals, identifiers, and comments. With Crystal having only CR-terminating comments, it's not clear that there can be an actual attack using comments, so let's focus on the other two vectors.
When it comes to identifiers, this is already discussed in #11216, and there are more Unicode characters than the bidi control ones that are problematic. I don't think there is much of a debate about that we need to properly handle these.
As for string literals, an option could be to let the format
tool to turn bidi control chars into their escape sequence. That will allow us to not be restrictive, and just impose it as a formatting standard.
I created a PR that warns on those unescaped characters, now that parser warnings are available. Then the follow-up would be escaping them whenever possible, which includes more things than string literals. For demonstration purposes, assume that 😂
is U+202E. Then the following:
"😂"
%(😂)
:😂
:"😂"
`😂`
def foo(😂 x)
end
foo(😂: 1)
FOO😂 = 1
%w(😂)
would be formatted into:
"\u202E"
%(\u202E)
:"\u202E" # note the addition of quotation marks
:"\u202E"
`\u202E`
def foo("\u202E" x)
end
foo("\u202E": 1)
FOO😂 = 1 # unchanged because escape sequences are unavailable here
%w(😂) # ditto
Format + warning sounds good to me.
To make it clear for everyone: We're not talking about replacing emojis. 😂
is just a stand-in for a bi-directional control character (U+202E
for example).
@HertzDevil What about comments?
# 😂
Should that become this?
# \u202E
Comment's don't actively support unicode escape sequences, but then it's just comments 🤷
but isn't the problem in this example about bidi chars in comments? That said, it'd be nice if we can still allow certain things like:
p "hello world" # مرحبا بالعالم
I think that if all the problematic characters to the left of #
are escaped, then the rendering of the non-comment part is never affected and an attack isn't possible, but I don't have a proof for this yet. If that is the case we don't need to escape anything within comments.
We cannot unconditionally escape all those characters in Crystal code blocks within comments, as that might produce invalid code. And then there are #<loc:...>
pragmas which syntactically behave like /* ... */
comments in other languages if one disregards its effects on stack traces; the first argument is a string that ignores escape sequences and can even span across multiple lines.
In any case, even if we don't escape the comments automatically, the compiler and formatter shall warn on them every time anyway. By the way, here is an example of an attack that involves no string literals at all, only comments, made possible because these control characters are allowed in identifiers:
user_id = 1000
if user_id # Check if admin != 1000
puts "You are an admin!"
end
The escaped form is:
user_id\u202E\u2066 = 1000
if user_id\u202E\u2066 # Check if admin\u2069\u2066 != 1000
puts "You are an admin!"
end
For identifiers, I think the compiler should eventually reject bidi control characters (ref #11216). But warning should be good as a start.
There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi
There is now a draft Unicode Technical Standard for this: https://www.unicode.org/reports/tr55/#Spoofing-bidi
Nice to see 😃
The solution is not to forbid the directional formatting characters; indeed the Ada example above does not use these. This document instead makes two recommendations.
First, source code editors should display source code according to its lexical structure, as described in Section 4.1, Bidirectional Ordering.
Second, computer languages should allow for the insertion of directional formatting characters as described in Section 3.2, Whitespace and Syntax, and implementers should provide tools that automatically remove spurious directional formatting characters, and insert the correct ones, as described in Section 5.2, Conversion to Plain Text.
A recently released paper titled Trojan Source: Invisible Vulnerabilities demonstrates an attack against source code. It uses Unicode bi-directional overrides to disguise the meaning of code to a human reader. This can lead to seemingly harmless code introducing malicious behaviour.
Crystal demonstration
The following code demonstrates a stretched-string attack in Crystal:
https://carc.in/#/r/c6ka
The following code demonstrates a commenting-out attack in Crystal:
https://carc.in/#/r/c6kh
They looks mostly unsuspicious. You wouldn't expect either to print anything. But both programs actually print
You are an admin!
despiteaccess_level = "user"
.The second lines of each program's source code contain a number of Unicode control characters for bi-directional overrides. This is what the parser reads:
The only indicator that something might be off is the syntax highlighting, which should be pretty resistant to being fooled. Github has already introduced a feature that shows a warning when bi-directional overrides are detected in a file: https://github.blog/changelog/2021-10-31-warning-about-bidirectional-unicode-text/
Mitigation
This vulnerability can be defended easily by disallowing bi-directional control characters in source code. In many locations, such control characters are already a syntax error. But they are currently valid in comments and string literals. Those are the typical spots for most languages.
However, Crystal's parser currently even accepts Unicode control characters in identifiers, including bi-directional override characters. Restricting the allowed character set in general is another problem and tracked in #11216.
I propose to change the language specification and lexer rules such that valid Crystal source code must not contain any bi-directional control characters, regardles of location.
A more fine-grained approach would be possible as well, but this should be unnecessary considering there are little to no legitimate use cases for bidirectional control characters in Crystal source code (but for some specific exceptions mentioned in the following section).
Workarounds
Bi-directional override characters are legitimate contents for string literals. Instead of encoding them directly in the source code, a proper workaround is to use escape sequences for that.
Bi-directional overrides can technically be legitimate in comments if you want to mix languages with different directions in the comment text. That does not seem like a very important use case, though.
Still, as a further enhancement, bi-directional overrides could potentially be allowed in comments and possibly other locations such as string literals as long as they are fully enclosed inside the comment or literal.
The general vulnerability is tracked as CVE-2021-42574.