DataDog / datadog-static-analyzer

Datadog Static Analyzer
https://docs.datadoghq.com/static_analysis/
Apache License 2.0
100 stars 12 forks source link

[STAL-1960] Implement ddsa NamedCapture #391

Closed jasonforal closed 4 months ago

jasonforal commented 4 months ago

What problem are you trying to solve?

When a tree-sitter query is run, it generates "match"es, which are a collection of "captures" (which are nodes tagged by a string name). Because some rules may want to implement a majority of their logic within JavaScript (as opposed to via the tree-sitter query itself), we need to be able to handle a large number of matches and captures performantly.

What is your solution?

Unlike all of our other JavaScript abstractions (which are classes with constructors), a "NamedCapture" is a plain object with only primitive values (no functions, getters, etc.). This means that we can (ergonomically) create these objects performantly from a v8 ObjectTemplate.

For this reason, we don't define the class in JavaScript. Rather, we just use capture.js to write JSDoc type annotations.

(Note: While it's true that we could also use an ObjectTemplate for our JavaScript classes, it would make the code much less maintainable, because it is much easier to change the JavaScript representation in JavaScript instead of via the v8 API--especially when it comes to functions and non-trivial objects).

Single vs Multi Capture

The most common capture will be a SingleCapture. This is generated by a query like

(variable_declarator
    name: (identifier) @varName
    value: (number) @numVal
)

In this case, @varName would be a SingleCapture, and @numVal would also be a SingleCapture.

However, some queries can be written like this:

(variable_declarator
    name: (identifier) @varName
    value: (array
         (number) @numVal
         (number) @numVal
         (number) @numVal
    )

@varName would still be a SingleCapture. However, @numVal would be a MultiCapture because the capture name is re-used across multiple query capture nodes.

SingleCapture contains a single v8 smi as the nodeId. MultiCapture contains multiple v8 smis as the nodeIds.

Why a Uint32Array?

When there is a MultiCapture, we represent this as a Uint32Array, not a standard array ([]). The reason for this is performance. While both would end up storing uint32s, there is no way to pre-allocate a JavaScript array without having it be treated as "holey", which triggers v8 de-optimizations. For example, if we used the v8 equivalent of const captureIds = new Array(2):

d8> %DebugPrint(new Array(2))
DebugPrint: 0x1ab00047f7d: [JSArray]
 - map: 0x01ab0028ccdd <Map[16](HOLEY_SMI_ELEMENTS)> [FastProperties]

...it would still be a holey array because it is sparse upon initialization.

We could use an empty generic array, and then just do the equivalent of [].push(1, 2, 3, /* ... */), but then we risk triggering additional allocations and memcpy if the array needs to be resized.

Instead, by using a Uint32Array, we can preallocate exactly the amount of space needed, using just one allocation, and have the exact same indexing/iterable interface that a generic array provides.

Why a string name, not an integer id?

We use a string name for captures.

Elsewhere, ddsa uses integers for performing. So why not here? Because we are creating these v8 objects entirely from the v8 API, we can guarantee the use of a kInternalized string, which does not allocate. This simplifies our implementation, because even though we know all of a rule's capture names ahead of time, we don't have to deal with parsing them and creating a map from id to string.

Alternatives considered

What the reviewer should know