Tree data structures - Githubissues

c42f commented 1 year ago

I wanted to collect some thoughts on tree data structures here. We've been discussing this over at https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7 but I'd like to have a link to that discussion here. And also a persistent central place to discuss the JuliaSyntax trees.

Trees we have

Green trees

Currently, we've got a GreenNode inspired by Roslyn and rust-analyzer. This acts as a raw syntax tree with subtrees which are "reusable" in the sense that they can be spliced into another parent green tree. The main benefit of this reusability would be if we implemented incremental parsing.

As discussed in https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7#issuecomment-1483743325, GreenNode being immutable means that the GC pressure of materializing a full green tree is quite low because the leaves are stored inline in the child array of their parent nodes.

Some numbers for base/abstractarray.jl which is a moderately large (~100 kb / 3000 line) Julia file:

There are 26534 tokens and 32731 nodes but only 6299 allocations in parsing this file to a green tree. In some sense, we are already winning by a factor of 5x here in terms of GC pressure by having GreenNode be immutable.
81% are leaf nodes which are never separately allocated
8% of nodes have only leaves as children. 73 % of these can be reused leading to only 4466 allocations. (But the lookup to do the reuse is slower than just allocating in this case and the win here doesn't seem amazing.)
For perspective, parsing without tree construction via JuliaSyntax.parse!() costs only 90 allocations.

using JuliaSyntax, BenchmarkTools
f = read(joinpath(Sys.BINDIR, Base.DATAROOTDIR, "julia", "base", "abstractarray.jl"), String)
# Parse to GreenNode
@benchmark JuliaSyntax.parseall(JuliaSyntax.GreenNode, f)
# Raw parse to ParseStream array data structure
@benchmark JuliaSyntax.parse!(JuliaSyntax.ParseStream(f))

Convenience interface for syntax trivia

GreenNode isn't very convenient to use. This is a big hole in the API right now, as there's no other way to access the syntax trivia. See also https://github.com/JuliaLang/JuliaSyntax.jl/issues/2

We could

Create an iterator interface which makes GreenNode more convenient to use
Or, maybe, enhance SyntaxNode with an easier way to access trivia children

Abstract Syntax - `SyntaxNode`

We've currently got SyntaxNode for this - it's somewhat like Expr in the sense that

It contains abstract syntax - syntax trivia aren't accessible in a simple way
It contains eagerly materialized Julia values in the leaves of the tree.
It is mutable

In contrast to Expr, it has the same strict child source ordering and heads as GreenNode.

SyntaxNode also has a parent field, but I'm not sure this is worthwhile! A good iterator/cursor interface might be more efficient and convenient?

Typed Syntax - `TypedSyntaxNode`

The TypedSyntax package https://github.com/JuliaDebug/Cthulhu.jl/tree/master/TypedSyntax contains an analog to SyntaxNode but with a bunch of type annotations and other useful type-related data in the TreeData. See

https://github.com/JuliaDebug/Cthulhu.jl/blob/master/TypedSyntax/src/node.jl

Generalizing tree attributes - trees we want

What's the right way for analysis passes to attach annotations to a tree? One way to do this is by putting them in the tree data structure itself - this is what the parameterized TreeNode{TreeData} achieves (used by SyntaxNode and TypedSyntaxNode). But this makes it hard to compose different types of passes together without having TreeData become fat and unwieldy, or having different passes know about each other.

A way around this is to store only an id per tree node and to store attributes in separate arrays or dictionaries indexed by this id. For sparse attributes an entity component system (ECS) setup Overseer.jl is excellent: attributes are the components, compiler passes are the systems, and tree nodes are the entities. @benchung has been trying this approach as noted in https://github.com/julia-vscode/JuliaWorkspaces.jl/issues/7#issuecomment-1483719069

For non-sparse attributes a set of parallel arrays is simpler though: think of a DataFrame or TypedTable where the rows are indexed by node ID and the columns are attributes. If we looked into the Julia graphs ecosystem I'd guess they store their attributes this way.

Tree API: traversal and child access

Currently, one traverses a tree with a "big pile of if statements" based on kind(node) and children(node) - essentially the Expr model of working with trees.

However, this is pretty cumbersome

Users need to know how Julia source maps to node kinds
Users need to know what the ordering of children means. User code contains things like node[3] - but what does that 3 mean? It's different in each different situation.

This situation is a mess!

Users need to remember a lot of fiddly detail. Expression manipulation code is hard to read with all the integer literals.
There's no separation between representation and API: any change to the tree data layout is a breaking change.

I think the right way to fix this is pattern matching: have a @match API which expands to the big pile of if statements based on kind which the user would otherwise have to write by hand. This the approach that rust-analyzer uses (admittedly Rust has language-native pattern matching but we can do it with a macro). It's also roughly the approach that Ben is taking in SemanticAST.jl, though that version still requires knowing the meaning of a given child ordering.

inkydragon commented 1 year ago

have a @match API

Available packages

https://github.com/thautwarm/MLStyle.jl, last update 2023-02 doc: https://thautwarm.github.io/MLStyle.jl/latest/ deps: none
https://github.com/kmsquire/Match.jl, last update 2022-11 doc: https://kmsquire.github.io/Match.jl/latest/ deps: none
https://github.com/FluxML/MacroTools.jl, last update 2022-10 doc: https://fluxml.ai/MacroTools.jl/stable/pattern-matching/ deps: Markdown, Random
https://github.com/RelationalAI-oss/Rematch.jl, last update 2022-07 doc: https://github.com/RelationalAI-oss/Rematch.jl#usage deps: MacroTools.jl

Related issues

Related Discussions

https://discourse.julialang.org/t/4-major-problems-of-pattern-matching-in-julia/34392

Perhaps we could develop a new pattern matching package/module on top of the information provided by JuliaSyntax.jl. This might be a good start to adding pattern matching syntax to julia.

It's also roughly the approach that Ben is taking in SemanticAST.jl

And SemanticAST.jl use MLStyle.jl

c42f commented 1 year ago

MLStyle is very comprehensive and I know SemanticAST is using that. MacroTools is great for simple matching, but I personally never liked having the dependency and tend to write the "big pile of if statements" by hand...

Personally I'd be satisfied with a really good special purpose tool for matching expressions. But if that eventually grew into something more general that's cool too.

BenChung commented 1 year ago

The single biggest issue I've had with MLStyle is that the documentation on how to customize it isn't great. It's there, just not particularly apparent. I still don't entirely get how decons works. If there was some pre-rolled SyntaxNode matcher that would address the problem. Other than that the experience has been pretty good.

I'm not sure what a more generic pattern matching system would look like, particularly if it's agnostic to order. Systems that let the user write a surface level syntax pattern and then compile it to a matcher are cool but beget issues understanding when they may or may not match some expression. As an example, I recently ran into a big annoyance with

foo(::T)::T where T<:Int

which parses as (:: (call foo (:: T)) (where T (<: T nothing Int)) (IIRC). This then means that if you match this against the naive term name_(args__)::retty_ the return type you'll get back is wrong because it includes the where when it should really just be the type variable.

As a result of this without something more like SemanticAST that tries to produce a single AST node for each concept in the language I'm not sure if there's a great way to abstract over what the precise Expr tree looks like. The user needs to spend a frustrating amount of time reasoning about "what is the head that shows up here" exactly with Exprs and directly abstracting over that in a pattern matcher is I suspect not possible due to contextually-dependent definitions of what a head might mean. One alternative might be to supplant the Expr/SyntaxNode structure entirely with GreenNodes, but I'm not sure if that actually makes the problem easier?

This was the problem that ended up scuppering my plans for a more Expr-targeted AST matcher that had automatic completeness checking: I couldn't define a coherent grammar for Expr/SemanticAST because it's natively all over the place and context-dependent.

c42f commented 1 year ago

Maybe foo(::T)::T where T<:Int was problematic because the precedence of :: and where is inconsistent in function definitions vs outside them?

julia> JuliaSyntax.parse(JuliaSyntax.SyntaxNode, "function f()::T where T end")
line:col│ tree                                   │ file_name
   1:1  │[function]
   1:10 │  [where]
   1:10 │    [::-i]
   1:10 │      [call]
   1:10 │        f
   1:15 │      T
   1:23 │    T
   1:24 │  [block]

julia> JuliaSyntax.parse(JuliaSyntax.SyntaxNode, "f()::T where T")
line:col│ tree                                   │ file_name
   1:1  │[::-i]
   1:1  │  [call]
   1:1  │    f
   1:6  │  [where]
   1:6  │    T
   1:14 │    T

I assumed it was intended because it has a special case in the reference parser, but it's an oddity for sure.

BenChung commented 1 year ago

That's absolutely the reason for that specific case; my experience is that with function names in particular there's just an endless list of weird oddities you need to find out about through frustration that don't always have obvious mappings into Expr-land.

In my opinion, what macro developers (who I presume are the target audience for an externally-facing tool) generally want is either:

Julia semantic matching ("I need a function definition, I don't care exactly what it looks like"). Wanting to do this is a big reason why people use MacroTools in my opinion.
Low level matching ("I want to write what is essentially my own language")

Right now Julia's Exprs are sort of somewhere in between those two. They are fairly closely aligned with the source syntax and require that the input be... roughly... valid Julia, but they also have all sorts of weird and wonderful forms for the same semantic thing making it annoying to reason about "my input should be valid Julia."

c42f commented 1 year ago

In my opinion, what macro developers (who I presume are the target audience for an externally-facing tool) generally want is either [...]

Good summary! I'd been assuming that the "Julia semantic matching" option would be something like giving macros an API for the syntax desugaring part of lowering. Which is what SemanticAST does I guess :-)

I think there's two audiences:

Tooling authors (language server etc), who can use any new JuliaSyntax-related libraries immediately
Macro writers, who could only use the new libraries in a new macro system (yikes!) Unless there's some quite tricky Expr compat layer.

Right now Julia's Exprs are sort of somewhere in between those two

Yes. I've been trying to push them a little toward more alignment with the source syntax (see #88). But I assume they'll always require that the input be roughly valid Julia, otherwise there's not a lot of meaning in having a parser at all.

(In principle macros can work on tokenized not parsed input like they do in Rust, but that's is a huge change I don't see happening. String macros are the escape hatch to handle the case of "my DSL looks nothing like Julia".)

BenChung commented 1 year ago

One idea is that a pattern matcher could go backwards from SemanticAST-like semantic concepts (like "a function definition") backwards to the full set of constructs that could have been used to create it. It could even round-trip, so for example we could take a match against a function declaration, extract the semantic concept ("a function declaration") and then generalize it to all the forms that could have been used to create that.

The idea is you could write a pattern like

FunctionDefinition(foo, [_..., Call(:bar, [args...], _), _...], _)

and then get all function definitions named foo that call bar in one of the positional argument positions without having to worry about if it has kwargs or type arguments or whatever. Then, that could be parsed from

function foo(_..., bar(args...), _...) _ end

and then converted back into the original matching form. This way you can write patterns in terms of the semantic meaning and not the precise syntactic forms.

This approach then makes the patterns independent of the Expr order because the ordering is now semantically defined rather than defined as part of Expr/SyntaxNode itself. The main disadvantage is that there might be some challenges escaping back into the underlying SyntaxNode structure?

o314 commented 1 year ago

The list of matchers should also include

c42f commented 1 year ago

Sure! (Note, IIUC, match.scm isn't used for much if anything within the Julia codebase.)

c42f commented 1 year ago

Just a small note - I've noticed that match.scm is used extensively in Base in macroexpand.scm. (Though it's been argued elsewhere that code is not well-placed, and should be mostly removed and the remainder folded into the lowering resolve-scopes pass)

JuliaLang / JuliaSyntax.jl

Tree data structures #230

Trees we have

Green trees

Convenience interface for syntax trivia

Abstract Syntax - `SyntaxNode`

Typed Syntax - `TypedSyntaxNode`

Generalizing tree attributes - trees we want

Tree API: traversal and child access

JuliaLang / JuliaSyntax.jl

Tree data structures #230

Trees we have

Green trees

Convenience interface for syntax trivia

Abstract Syntax - SyntaxNode

Typed Syntax - TypedSyntaxNode

Generalizing tree attributes - trees we want

Tree API: traversal and child access

Abstract Syntax - `SyntaxNode`

Typed Syntax - `TypedSyntaxNode`