Address Trojan Source risk in agoric-sdk

dckc commented 2 years ago

What is the Problem Being Solved?

As discussed in Trojan Source Attacks aka CVE-2021-42574 and CVE-2021-42694, there's a risk that our code is modified maliciously in a way that's difficult / impossible to see. (For associated risks with packages we depend on, see #4025)

Description of the Design

It seems cost-effective to add some scan to our ci and/or release process. This could take the form of disallowing the Unicode characters presenting a risk wholesale, or in a targeted way to address attack vectors.

Additional context

This issue covers only impact to our own source code. For impact to dependencies see https://github.com/Agoric/agoric-sdk/issues/4025. For impact to code written on top of our stack, see https://github.com/endojs/endo/issues/923

Node.js project review of CVE-2021-42574 and CVE-2021-42694

cc @warner @Chris-Hibbert @kriskowal @mhofman @michaelfig

Chris-Hibbert commented 2 years ago

I've been weighing in on this discussion mostly to suggest that we ought to be providing support to our customers for hardened JavaScript and Jessie. If we can find ways to beef up our eslint or prettier configurations to call out BIDI characters, that will be an enormous benefit to our developer community.

zarutian commented 2 years ago

one eslint rule maybe worth considering is to enforce that any BIDI codepoints encountered in string literals or comments MUST come in pairs inside that string literal or comment.

Another is to point out homomorphic glyphs when they are not in a run of glyphs from the same page.

Otherwise this „Trojan Source“ thing sounds like security alarmist nonsense that evokes English-only speakers to come up with boneheaded non-solutions such as „use ASCII only“.

Plus, why is code still being represented as text instead of other forms?

mhofman commented 2 years ago

I'll sum up my opinion on this risk:

BiDi characters seem to require being used inside comments. I'm not sure how frequent BiDi characters may be in source files for authors in RTL languages. GitHub now has a warning if a file includes any of these, and that might be sufficient. The next step up is to detect them inside comments and fail CI. I'm not sure if we can extend tools like ESlint to parse and validate comments.
Homoglyphs or zero-width-joiners are only a problem for shadowing identifiers. GitHub does not warn about this at all. However this seems like a perfect job for an extended ESLint rule, in particular the no-redeclare and no-shadow rules. I believe the JavaScript community should extend those rules to check for identifiers with homoglyphs and zero-width-joiners, and we should enable such a rule by default.

mhofman commented 2 years ago

Commented along the same lines on https://github.com/eslint/eslint/issues/15240 which seem to be the only issue on ESLint regarding Trojan Source.

mhofman commented 2 years ago

Zero Width Space already seem to have a rule: https://eslint.org/docs/rules/no-irregular-whitespace

dckc commented 2 years ago

GitHub now has a warning if a file includes any of these, and that might be sufficient.

At what point in my workflow would I discover this warning? Only if I inserted such code? Or if I were reviewing such code?

It doesn't help with dependencies, right?

mhofman commented 2 years ago

IMO source code of dependencies are entirely out of scope and part of the larger "supply chain risk".

So yes, the GitHub warning comes into play when reviewing code changes. We may want an ESLint rule that disallow BiDi characters in our own source, but I'd be worried it's too strong of a rule for Jessie.

dckc commented 2 years ago

This issue is explicitly about supply-chain risks. It's possible that we decide to take no further action, but it's not out of scope.

p.s. Jessie rules are the subject of https://github.com/endojs/endo/issues/923

mhofman commented 2 years ago

our code or one of the packages we depend on is modified maliciously

This issue as written seem about both our code, as well as our dependencies. I'm advocating we split them as I believe the mitigations are completely different.

dckc commented 2 years ago

I don't feel strongly about splitting the issue further; feel free to open another issue. I'm not likely to do it myself.

jessysaurusrex commented 2 years ago

We could do a simple one-time manual audit to surface any malicious character usage in our dependencies (grep ftw!), but the long term, scalable solution here is to set up automatic linting rules to detect these unicode characters (per @mhofman's suggestions).

Long term, if we expect humans who create smart contracts to manually validate that code in any way, it would definitely be worth it to understand how that process and tooling would work so we have a clearer idea of where the opportunities to provide warnings or surface risk are. Ultimately, we will have users executing untrusted code in the form of smart contracts and investing in additional risk mitigations that surface issues like this will provide a more comprehensive solution.

It is unlikely that deny listing would be a stable long-term solution to this issue and others like it because unicode is quite large and complex. This is far from the first attack in this class of human perception bugs with unicode. It is likely that we will continue to see more issues similar to this arise in the future, and we will have an ever-growing set of linting rules to mitigate them.

mhofman commented 2 years ago

scalable solution here is to set up automatic linting rules to detect these unicode characters

I don't believe that such scanning is possible or worth it for dependencies. It would be for Agoric maintained code however.

I opened #4025 to track risk from dependencies separately.

jessysaurusrex commented 2 years ago

I don't believe that such scanning is possible or worth it for dependencies. It would be for Agoric maintained code however.

It is possible to do this for third-party code, and in the future, we may choose to take a risk-based approach to our dependencies by vendoring (e.g. pinning and scanning) those that would have a critical or high-risk impact to our code.

mhofman commented 2 years ago

It is possible to do this for third-party code, and in the future, we may choose to take a risk-based approach to our dependencies by vendoring (e.g. pinning and scanning) those that would have a critical or high-risk impact to our code.

I've captured that in the #4025 issue.

Commented along the same lines on eslint/eslint#15240 which seem to be the only issue on ESLint regarding Trojan Source.

Looks like the thinking on the ESLint repo seem to be that the existing rules (no-shadow / no-redeclare) cannot be extended to support detection of similar identifiers, but that an additional rule would be useful. However that rule would probably not be in core ESLint.

mhofman commented 2 years ago

The VSCode release notes have a good example of BiDi characters used in code (unlike comments), that change the behavior of the program:

https://code.visualstudio.com/updates/v1_62#_unicode-directional-formatting-characters

//              from, to, ammount
transferBalance(5678,‮6776,4321‬,"USD");

Rendered control characters

Chris-Hibbert commented 2 years ago

Does JavaScript define those characters as whitespace? (i.e. allowed between arbitrary tokens) eek!

That seems like something prettier could prohibit.

mhofman commented 2 years ago

Yeah maybe we can repurpose something like https://eslint.org/docs/rules/no-tabs to disallow any of the BiDi characters. I do expect someone in the community to come up with such a rule. I'm less certain about a rule with homoglyphs identifiers.

After reading a little bit of Unicode TR39 and the different restriction levels, we could also have a rule that implements these restriction levels for strings and identifiers. It would be good for the community to settle on what the Identifer Profile for JavaScript identifiers should look like. I still believe a restriction level approach is over-broad for identifiers and a more targeted "no-homoglyph-shadow" rule would be more appropriate, but restriction levels could be useful for strings.

Agoric / agoric-sdk