Add more field names? - Githubissues

ethindp commented 2 months ago

Would it be possible for you to add more field names? I'm considering using the node-types.json to generate static code definitions of the node structure, but a lot of the rules don't have fields so the resulting data structure would just be a list of other nodes which isn't all that helpful lol. (I'm trying to build a symbol table if it helps.)

briot commented 2 months ago

Hello, I am certainly very open to the idea. Would you be able to suggest a pull request, especially if you have a tool that would kind of help you detect the missing ones. Ideally the names should match what the AARM uses, though that might not always be possible I think

ethindp commented 2 months ago

@briot Do you mean like a list of nodes that don't have fields? That can be retrieved by looking at the node-types.json file. I'd be happy to contribute (along with fixing some things like the pattern for identifiers to match the Ada standard, as well as adding parallel block statements, at least, I think I managed to add them) but I'm uncertain what to name. Do I add field() around the items in choice()? Or just in the sequences? (I ask because I'm still getting used to the tree-sitter API, so I don't know when I should and shouldn't name something.)

briot commented 2 months ago

We'll have to learn together, as I am also not that familiar with tree-sitter in practice. All I wanted was syntax highlighting in vim, which I have. I'd love it if the grammar could be used in more contexts though, so naming nodes definitely looks like it would be nice. Maybe we can look at specific cases to get a feel for the problem, and decide based upon that

ethindp commented 2 months ago

I can already see a situation where this would be useful. Well, two, actually: type declarations/definitions/constraints and expressions. Using the index works, particularly for expressions: expressions, factors, terms, and primaries usually have a left and right-hand side. However, take for example this full_type_declaration:

type Queue is limited interface;
type Node is abstract new Ada.Finalization.Limited_Controlled and Queue with record
   Previous : not null access Node'Class := Node'Unchecked_Access;
   Next     : not null access Node'Class := Node'Unchecked_Access;
end record;

Ignoring the components of the record_definition, this will cause two full_type_declarations to be generated. But here is where my questions arise (and I may just look at different ASTs to figure this out, or maybe write a small AST walker to "tree"-ify the AST so I can understand it easier). The most significant question is: in something like a full_type_declaration, which has an (optional) known_discriminant_part, does that node still get generated as an empty node in the tree, i.e., the tree will be soomething like:

type (reserved word)
Queue (identifier)
null (known_discriminant_part)
is (reserved word)
limited interface (type_definition, but this is hidden)

Or will it appear as:

type (reserved word)
Queue (identifier)
is (reserved word)
limited interface (type_definition, but this is hidden)

If the former, that makes things a lot easier, and we may not need to name anything else at all (though doing so would definitely be useful). But if it's the latter, that complicates a lot, because then I'll need to figure out how to distinguish between when the known_discriminant_part is present or not. I feel like Tree-Sitter might do the former (because this would be sensible behavior) but I'm just not sure. And S-expressions were never my strong suit, so reading the S-expression parse tree makes it difficult for me to mentally model.

ethindp commented 2 months ago

@briot Okay, so I just wrote a quick function to walk the tree and output markdown (not the prettiest but...). The indentation helps me visualize the tree in my head. I'd be happy to paste the code here if you like so you can use it yourself. :) From what I can see, the tree is a bit broken. Take for example the following Ada code (from RosettaCode):

type Queue is limited interface;
procedure Enqueue (Lounge : in out Queue; Item : in out Element) is abstract;
procedure Dequeue (Lounge : in out Queue; Item : in out Element) is abstract;

This file is not actually a valid Ada program: it begins with a full_type_declaration, which from my reading of the grammar cannot appear outside a library_unit_declaration. A compilation_unit can be a library_unit_body or a library_unit_renaming_declaration, with either the library_unit_declaration or library_unit_renaming_declaration preceded by private, but you can't just start a compilation_unit with a type_declaration if I'm understanding things properly.

I'm not sure what to do here. It looks like Tree-Sitter is mis-parsing the file: if you look at the parse tree, it looks something like:

compilation
- compilation_unit
  - full_type_declaration
    - ...

Thought I would let you know about this, I'll look into the grammar to try to figure out why it's doing this (because it shouldn't be).

ethindp commented 2 months ago

Okay, so I managed to solve the aforementioned problem. Somewhere in the grammar, change compilation_unit to the following and add the new rules:

        compilation_unit: $ => seq(
            repeat($._context_item),
            choice(
                $.library_item,
                $.subunit
            )
        ),
        library_item: $ => choice(
            seq(optional(reservedWord('private')), choice($._library_unit_declaration, $._library_unit_renaming_declaration)),
            $._library_unit_body
        ),
        _library_unit_declaration: $ => choice($.subprogram_declaration, $.package_declaration, $._generic_declaration, $.generic_instantiation),
        _library_unit_renaming_declaration: $ => choice($.package_renaming_declaration, $.generic_renaming_declaration, $.subprogram_renaming_declaration),
        _library_unit_body: $ => choice($.subprogram_body, $.package_body),
        _context_item: $ => choice($.with_clause, $.use_clause),

This fixes the problem. I would submit a PR but I formatted the file with js-beautify so the diff would essentially show every line of the grammar as being changed in someway. I can still do it though if you like. My modifications do also remove a number of unnecessary conflicts (according to tree-sitter 0.22.6). If you want my version to become the new grammar let me know and I'll open a PR.

briot commented 2 months ago

It sure would be interesting to look at the changes to the grammar (perhaps outside of the auto-formatting, for which we can look at a different PR possibly). I am slightly worried that you might have the wrong expectations here though: tree-sitter doesn't generate a full AST. On purpose, it generates a much simplified tree instead, since its main purpose is for syntax-highlighting / indentation purposes. If you want or need a full AST, you might want to start from libadalang instead. Also tree-sitter can work with invalid files (as happens as you are editing), or extracts of a file (as would happen in a text documentation including Ada code for instance).

So I am certainly interested in your changes to the grammar (as long as the tests still pass of course), but can make no promise to integrate them...

Your function to walk to the tree and output markdown is likely not relevant to this repository. It might be useful to tree-sitter itself, though possibly they already have similar tools. In our case, we mostly have the tests which each come with the output tree.

ethindp commented 2 months ago

I am slightly worried that you might have the wrong expectations here though: tree-sitter doesn't generate a full AST. On purpose, it generates a much simplified tree instead, since its main purpose is for syntax-highlighting / indentation purposes. If you want or need a full AST, you might want to start from libadalang instead.

Your not the only person to recommend I start with that. I'm hesitant to because it requires unconventional tools and methods to actually build it, and it still doesn't support Ada 2022. There sadly are not many other tools/parser generators that can handle Ada's grammar, either, without a lot of refactoring and fiddling with the grammar to make it happy. Other than libadalang (which I couldn't get to build) and this, I don't know a grammar for Ada that is able to actually parse Ada code beyond trivial code that does nothing.

The auto-formatting was something I manually did with js-beautify (I have my own copy of the grammar). I could probably get a somewhat "full" AST out of tree-sitter but you do have a point that it probably isn't designed for that purpose. (There's an Ada grammar I found that does give a full AST, but it generated a parser that was 320MB of C code, so... Yeah, there was no way I was going to use that, especially since it would take ages to compile.) This is one of those times where I wish the specification just resolved all the conflicts and stuff already.. :-(

briot / tree-sitter-ada

Add more field names? #13