DataDog / datadog-static-analyzer

Datadog Static Analyzer
https://docs.datadoghq.com/static_analysis/
Apache License 2.0
100 stars 12 forks source link

[STAL-2195] Initial implementation of intra-method taint analysis in Java #493

Closed jasonforal closed 1 month ago

jasonforal commented 2 months ago

What problem are you trying to solve?

Some significant security vulnerabilities (e.g. SQL injection) can be caught with taint analysis.

Currently, writing accurate rules to detect these vulnerabilities is difficult because we can currently only very crudely approximate the flow of variables through a program. To address this, we will build "native support" for (presently: intra-method) taint analysis for Java.

What is your solution?

[!NOTE] This is a large PR. All commits are intentionally organized and purely additive to each other. It's likely easier to review commits individually, in sequential order.

Preface

Our goal is to implement simple taint analysis in a way that builds on our current CST-based technique.

From a theoretical perspective, a CST-based analysis hits an accuracy ceiling pretty quickly because CST nodes only describe syntactic structure. More traditional approaches first construct an AST, lower that AST to an intermediate form like SSA, construct a control flow graph CFG from basic blocks, and use that to simulate abstract program states (essential for symbol/name resolution, type resolution, constant propagation, among other techniques).

Doing this would require re-architecting and re-implementing some core assumptions of the analyzer, and so that's an out-of-scope consideration.

Methodology

[!NOTE] Terminology: Source: A place in a source code where data originates (e.g. the headers from an HTTP request). Sink: A place in a source code where that data can end up (and particularly that this data could have harmful side effects -- e.g. a SQL statement).

  1. Rule author defines tree-sitter query only for a "sink". Sink captures are sent to the JavaScript runtime and a standard analysis begins.
  2. The sink node's containing method is located.
  3. The MethodFlow class recursively traverses that containing method, iteratively constructing a directed graph that represents a backwards flow analysis. This graph has CST nodes for vertices, and assignment/dependence relationships for edges.
  4. The digraph is then traversed and all paths from the sink to any "source" (i.e a vertex with an out-degree of 0) are collected.
  5. We expose these paths to the rule author as a TaintFlow, and so the rule can perform its own logic. This unit test shows how these APIs are used and what they expose.

Limitations

Some limitations are artificial and were chosen for simplicity. These can be addressed in future iterations:

Out of Scope

Features/abstractions this PR does not implement:

Alternatives considered

What the reviewer should know