When a tree-sitter query is run, it generates "match"es, which are a collection of "captures" (which are nodes tagged by a string name). Because some rules may want to implement a majority of their logic within JavaScript (as opposed to via the tree-sitter query itself), we need to be able to handle a large number of matches and captures performantly.
What is your solution?
Unlike all of our other JavaScript abstractions (which are classes with constructors), a "NamedCapture" is a plain object with only primitive values (no functions, getters, etc.). This means that we can (ergonomically) create these objects performantly from a v8 ObjectTemplate.
For this reason, we don't define the class in JavaScript. Rather, we just use capture.js to write JSDoc type annotations.
(Note: While it's true that we could also use an ObjectTemplate for our JavaScript classes, it would make the code much less maintainable, because it is much easier to change the JavaScript representation in JavaScript instead of via the v8 API--especially when it comes to functions and non-trivial objects).
Single vs Multi Capture
The most common capture will be a SingleCapture. This is generated by a query like
@varName would still be a SingleCapture. However, @numVal would be a MultiCapture because the capture name is re-used across multiple query capture nodes.
SingleCapture contains a single v8 smi as the nodeId.
MultiCapture contains multiple v8 smis as the nodeIds.
Why a Uint32Array?
When there is a MultiCapture, we represent this as a Uint32Array, not a standard array ([]). The reason for this is performance. While both would end up storing uint32s, there is no way to pre-allocate a JavaScript array without having it be treated as "holey", which triggers v8 de-optimizations. For example, if we used the v8 equivalent of const captureIds = new Array(2):
...it would still be a holey array because it is sparse upon initialization.
We could use an empty generic array, and then just do the equivalent of [].push(1, 2, 3, /* ... */), but then we risk triggering additional allocations and memcpy if the array needs to be resized.
Instead, by using a Uint32Array, we can preallocate exactly the amount of space needed, using just one allocation, and have the exact same indexing/iterable interface that a generic array provides.
Elsewhere, ddsa uses integers for performing. So why not here? Because we are creating these v8 objects entirely from the v8 API, we can guarantee the use of a kInternalized string, which does not allocate. This simplifies our implementation, because even though we know all of a rule's capture names ahead of time, we don't have to deal with parsing them and creating a map from id to string.
Alternatives considered
What the reviewer should know
This is purely the (isolated) JavaScript implementation of a "capture"
This doesn't implement the "match", as described in the description intro. That will be in a future PR.
The translation between the tree-sitter query Rust bindings to this JavaScript object will be implemented in a future PR that switches the use of stella over to ddsa.
What problem are you trying to solve?
When a tree-sitter query is run, it generates "match"es, which are a collection of "captures" (which are nodes tagged by a string name). Because some rules may want to implement a majority of their logic within JavaScript (as opposed to via the tree-sitter query itself), we need to be able to handle a large number of matches and captures performantly.
What is your solution?
Unlike all of our other JavaScript abstractions (which are classes with constructors), a "NamedCapture" is a plain object with only primitive values (no functions, getters, etc.). This means that we can (ergonomically) create these objects performantly from a v8 ObjectTemplate.
For this reason, we don't define the class in JavaScript. Rather, we just use capture.js to write JSDoc type annotations.
(Note: While it's true that we could also use an ObjectTemplate for our JavaScript classes, it would make the code much less maintainable, because it is much easier to change the JavaScript representation in JavaScript instead of via the v8 API--especially when it comes to functions and non-trivial objects).
Single vs Multi Capture
The most common capture will be a
SingleCapture
. This is generated by a query likeIn this case,
@varName
would be aSingleCapture
, and@numVal
would also be aSingleCapture
.However, some queries can be written like this:
@varName
would still be aSingleCapture
. However,@numVal
would be aMultiCapture
because the capture name is re-used across multiple query capture nodes.SingleCapture
contains a single v8 smi as thenodeId
.MultiCapture
contains multiple v8 smis as thenodeIds
.Why a Uint32Array?
When there is a
MultiCapture
, we represent this as aUint32Array
, not a standard array ([]
). The reason for this is performance. While both would end up storinguint32
s, there is no way to pre-allocate a JavaScript array without having it be treated as "holey", which triggers v8 de-optimizations. For example, if we used the v8 equivalent ofconst captureIds = new Array(2)
:...it would still be a holey array because it is sparse upon initialization.
We could use an empty generic array, and then just do the equivalent of
[].push(1, 2, 3, /* ... */)
, but then we risk triggering additional allocations and memcpy if the array needs to be resized.Instead, by using a
Uint32Array
, we can preallocate exactly the amount of space needed, using just one allocation, and have the exact same indexing/iterable interface that a generic array provides.Why a string name, not an integer id?
We use a string name for captures.
Elsewhere, ddsa uses integers for performing. So why not here? Because we are creating these v8 objects entirely from the v8 API, we can guarantee the use of a kInternalized string, which does not allocate. This simplifies our implementation, because even though we know all of a rule's capture names ahead of time, we don't have to deal with parsing them and creating a map from id to string.
Alternatives considered
What the reviewer should know