[STAL-2195] Initial implementation of intra-method taint analysis in Java

What problem are you trying to solve?

Some significant security vulnerabilities (e.g. SQL injection) can be caught with taint analysis.

Currently, writing accurate rules to detect these vulnerabilities is difficult because we can currently only very crudely approximate the flow of variables through a program. To address this, we will build "native support" for (presently: intra-method) taint analysis for Java.

What is your solution?

[!NOTE] This is a large PR. All commits are intentionally organized and purely additive to each other. It's likely easier to review commits individually, in sequential order.

Preface

Our goal is to implement simple taint analysis in a way that builds on our current CST-based technique.

From a theoretical perspective, a CST-based analysis hits an accuracy ceiling pretty quickly because CST nodes only describe syntactic structure. More traditional approaches first construct an AST, lower that AST to an intermediate form like SSA, construct a control flow graph CFG from basic blocks, and use that to simulate abstract program states (essential for symbol/name resolution, type resolution, constant propagation, among other techniques).

Doing this would require re-architecting and re-implementing some core assumptions of the analyzer, and so that's an out-of-scope consideration.

Methodology

[!NOTE] Terminology: Source: A place in a source code where data originates (e.g. the headers from an HTTP request). Sink: A place in a source code where that data can end up (and particularly that this data could have harmful side effects -- e.g. a SQL statement).

Rule author defines tree-sitter query only for a "sink". Sink captures are sent to the JavaScript runtime and a standard analysis begins.
The sink node's containing method is located.
The MethodFlow class recursively traverses that containing method, iteratively constructing a directed graph that represents a backwards flow analysis. This graph has CST nodes for vertices, and assignment/dependence relationships for edges.
The digraph is then traversed and all paths from the sink to any "source" (i.e a vertex with an out-degree of 0) are collected.
We expose these paths to the rule author as a TaintFlow, and so the rule can perform its own logic. This unit test shows how these APIs are used and what they expose.

Limitations

Some limitations are artificial and were chosen for simplicity. These can be addressed in future iterations:

Mutually exclusive control-flow branches are treated as if they were sequential. Addressing this requires a refactor in how we construct the initial graph.
Variable scoping is not supported (and thus shadowing will be confused for outer scope usage/definition).
Some Java language constructs have not been implemented, as they are dependent on other limitations we've chosen (e.g. field_access)
We hardcoded some heuristics to propagate taint that aren't always true (one, two). We should eventually implement hooks either we or the user can specify custom "propagator" patterns that are aware of, e.g. specific standard library methods.

Out of Scope

Features/abstractions this PR does not implement:

Support for multiple physical locations (both in a Rust RuleResult and in the SARIF output.
Taint sanitizers
Fine-grained definition of taint propagators (detailed in "Limitations")
Inter-method graph building

Alternatives considered

What the reviewer should know

Because our methodology uses a lot of known simplifications that we eventually might want to revisit, these are organized into separate modules so the behavior is both a) easily seen as intentional and b) easy to see what needs to be fixed.
The digraph created is a mostly unfiltered interpretation of the relationship between CST nodes -- it will likely contain superfluous syntax nodes that rules will almost always never care about (e.g. parenthesized_expression).
Renaming from TreeSitterNode.type to TreeSitterNode.cstType was overdue and so was squeezed in here.
Despite the expected bad performance of using string comparisons for cstType, we still do it for implementation simplicity.
- We should eventually refactor this to use the numeric "kind id" that the tree-sitter grammar assigns (which would replace a map lookup and string comparison with just an integer comparison).
- This was omitted because, to do robustly, the grammar ids used in JavaScript need to be guaranteed to be in sync with the compiled grammar. (This is logic our TsLanguageContext implements, but the abstraction is not 1:1 for this use case, so it would require a small refactor -- deemed out of scope for this PR).

DataDog / datadog-static-analyzer