Schottkyc137 / ginko

A device-tree source parser, analyzer and language server.
MIT License
15 stars 5 forks source link

Similar project #17

Open axelkar opened 2 weeks ago

axelkar commented 2 weeks ago

Hi! I've been making my own devicetree parser, analyzer and LSP in dt-tools. Maybe we could collaborate?

In the latest revision, it uses Logos for lexing (it's broken 1 2) and a custom parsing framework taking inspiration from Rust-analyzer, C#'s Roslyn and cstree. I don't use cstree or rowan because I want to keep node and token kinds separate and avoid unsafe code.

I haven't worked on it in a while and development has been slow because of the difficulty of lexing+parsing DTS:

  1. DTS can contain minuses (-) in identifiers but I want to lex them separately for macros and property value expressions.
  2. I need to use a separate lexer, to keep error recovery working.

For now, I've used the following code to mix pure ident tokens and e.g. minus and comma tokens, but I'm scared it's fragile. I create "name" tokens inside the parser, between lexing and the CST (concrete syntax tree). https://github.com/axelkar/dt-tools/blob/d3a3a9cefd990777bfb385d3a2dd063595fc701e/crates/dt-parser/src/cst2/parser.rs#L442-L480

axelkar commented 2 weeks ago

How did you get contributors to join btw?

Our projects were made around the same time: February 8th and March 1st

CST is used so I can add IDE features (refactoring, formatting) basically for free. I just parse the code, modify the identifiers and stringify it out.

As for preprocessor macros and includes, I've tried to go the short way by just listing them with a file's parsed metadata but I don't think it can work. Not doing that would likely require parsing files multiple times, at each include. Avoiding parsing in passes (at every edit, start from the root dts file and evaluate/expand all imports and macros) or multiple times is cruicial for performance.

Schottkyc137 commented 2 weeks ago

Hi! First of all thanks a lot for your interest! I definitely think that there are lots of ways we can share some of the efforts. How such a contribution would look like, we would probably have to discuss at some point. But I would love to hear your ideas.

In the latest revision, it uses Logos for lexing (it's broken https://github.com/maciejhirsz/logos/issues/399 https://github.com/maciejhirsz/logos/issues/265) and a custom parsing framework taking inspiration from Rust-analyzer, C#'s Roslyn and cstree. I don't use cstree or rowan because I want to keep node and token kinds separate and avoid unsafe code.

In my current implementation, I use a strongly typed AST (i.e., each node is a struct that contains all information). The lever is also custom. I have found that this is generally the easy part and I feel like I need a framework. However, I have noticed that this approach is quite limiting when it comes to actually implementing an LSP. It certainly works for a lot of use-cases, but writing visitors and formatters becomes a bit of a pain. I currently want to rewrite most of the AST and lexing using rowan, because I have found that to be very useful. I'm honestly not too scared of unsafe code as long as it's contained in a crate that is actively maintained. If the API only uses safe code, and the project is large enough, I personally believe that this is fine. The benefit is that the parser and analyser is quite fast and uses a well-known framework. This can be helpful for contributors. I didn't push the branch to GitHub, but if you're interested I can upload it.

I haven't worked on it in a while and development has been slow because of the difficulty of lexing+parsing DTS:

  1. DTS can contain minuses (-) in identifiers but I want to lex them separately for macros and property value expressions.
  2. I need to use a separate lexer, to keep error recovery working.

Yea, this is an issue that I tried to solve as well. I have two approaches, one that I currently use and one in the new implementation using rowan

  1. Currently, the lever is stateful. This means that, for example, after a ; token (it's a bit more elaborate than that but it explains the point), the lexer expects a node name (which can contain -, + and other weird tokens).
  2. Using the rowan approach, the lexer is not stateful, so + and + would be their own tokens. When parsing, and expecting a node name, the parser combines any tokens (in this case not ignoring whitespaces and comments) to a single node name. So for example node-name would be a single name, but node - name would be three distinct AST elements. I have found this approach more robust as it nicely deals with erroneous input. For 1) , I needed to insert 'pseudo tokens' (for example, if the user forgot a ';'), which will no longer be needed. I think this is similar to your approach, but I don't think it's very fragile.

How did you get contributors to join btw?

Honestly I didn't do anything special. I think it helps that there is a VSCode Plugin and a mason plugin for neovim, so people know about this project.

As for preprocessor macros and includes, I've tried to go the short way by just listing them with a file's parsed metadata but I don't think it can work. Not doing that would likely require parsing files multiple times, at each include. Avoiding parsing in passes (at every edit, start from the root dts file and evaluate/expand all imports and macros) or multiple times is cruicial for performance.

Yea, for this reason I haven't implemented them so far. The C Preprocessor is also technically also not part of the standard Device-Tree Syntax, so this is my excuse for now ;) But this is definitely planned and my current idea is to deal with the preprocessor on the token-stream level and cache include files. There currently is a fairly elaborate scheme to avoid parsing and analysing every dependent file when something has changed on basis of a single file. I like the idea, but not the implementation as I think the graph search algorithm could def. be improved on.

axelkar commented 2 weeks ago

I didn't push the branch to GitHub, but if you're interested I can upload it.

I'm interested to take a look.

So for example node-name would be a single name, but node - name would be three distinct AST elements. I have found this approach more robust as it nicely deals with erroneous input. [..] I think this is similar to your approach, but I don't think it's very fragile.

This is exactly how I have implemented it.

But this is definitely planned and my current idea is to deal with the preprocessor on the token-stream level and cache include files

I think I'm doing something a little similar to Rust-analyzer, but here's the process for a macro invocation in a cell list (a = <1 2 3 MACRO_HERE>;)

  1. Take macro from AST (helper API over CST; allows referencing to CST; syntax tree in r-a terminology), resolve it code
  2. Substitute it, following the C preprocessor as closely as possible collect arguments from invocation (grammar splits arguments up nicely, this just stringifies them individually), substitution method
  3. Reparse the substitution output and recursively analyze it code

There currently is a fairly elaborate scheme [..]

That's a nice algorithm, but when you import with the preprocessor, there are so many variables on each import you have to check that they are not changing anything. (macros defined in the root dts and used in a dtsi; there is code like this in the Linux kernel)

With importing files multiples times, I'm thinking of using the following, but I don't think it's extensible enough for some import scenarios/trees: https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#relatedFullDocumentDiagnosticReport

Talking about trees, a tree visualizer using graphviz for what imports have been merged would be cool

Have you tried Salsa? I tried to use it to manage all incremental compilation and analysis data but it had some flaws a couple of months ago. When you test it, make sure to use the Git version as the crates.io version hasn't been updated in years

Schottkyc137 commented 2 weeks ago

I'm interested to take a look.

https://github.com/Schottkyc137/ginko/tree/rowan

This is exactly how I have implemented it.

Nice, good to know I'm following the best ;) Can I ask you why you think that this is fragile? This seems like a fairly robust solution. I think that the "finicky" parts is just a weakness of the device-tree syntax when used in tools.

I think I'm doing something a little similar to Rust-analyzer, but here's the process for a macro invocation in a cell list (a = <1 2 3 MACRO_HERE>;)

Ah, so this is part of the analysis stage? I always thought it had to be done in some earlier stage because macros can just be anything i.e.,

#define MACRO (1 +

{
    some-node = <MACRO 2)>;
};

is correct syntax as far as I know.

That's a nice algorithm, but when you import with the preprocessor, there are so many variables on each import you have to check that they are not changing anything. (macros defined in the root dts and used in a dtsi; there is code like this in the Linux kernel)

Yea, I'm also afraid it won't scale very well

Talking about trees, a tree visualizer using graphviz for what imports have been merged would be cool

I guess that should be fairly easy to implement. If I understand you correctly, you mean emitting a dot file containing all files and their dependencies?

Have you tried Salsa? I tried to use it to manage all incremental compilation and analysis data but it had some flaws a couple of months ago. When you test it, make sure to use the Git version as the crates.io version hasn't been updated in years

Have not tried it yet but I'll definitely check it out. I have seen that it was used in rust-analyser, so it'll probably be worth the effort.

Did you think of anything concrete one could collaborate on? Certainly, the most extreme would be merging our two projects into one (or dropping one project in favour of the other), but this seems a bit extreme. Did you have anything concrete in mind?

axelkar commented 2 weeks ago

Nice, good to know I'm following the best ;)

Thanks!

Can I ask you why you think that this is fragile? This seems like a fairly robust solution. I think that the "finicky" parts is just a weakness of the device-tree syntax when used in tools.

Actually I wouldn't classify it as fragile. Nevermind it. It's more like that it's not used in any other project. Creating dynamic joined tokens is also AFAIK rare (rustc has joined tokens, but their components are statically known)

Afterthought: A part of the thought of fragility comes from the nature of the DTS format. If you accidentally add whitespace (e.g. node -name) it could error about having a minus in place of an equal sign. I might have code to recover from this though.

Ah, so this is part of the analysis stage? I always thought it had to be done in some earlier stage because macros can just be anything [..]

It's because I don't plan to support odd macros like that. Nobody should use code like that in the wild for devicetree. It's a whole lot easier to deal with code ranges when you don't have to check that everything isn't from a macro.

Macros can only go in place of identifiers (nodes, properties, labels and extensions), values and cells. Macro names cannot contain special characters including operators, commas and the equal sign.

Afterthought: It would also make IDE features extremely hard to accomplish.

If I understand you correctly, you mean emitting a dot file containing all files and their dependencies?

Yes. It'd be interesting to see which includes can be merged during analysis.

Did you think of anything concrete one could collaborate on? Certainly, the most extreme would be merging our two projects into one (or dropping one project in favour of the other), but this seems a bit extreme. Did you have anything concrete in mind?

I think merging our projects would be best. Working towards the same goal separately would be inefficient. Not to be rude, but I think that my project has more progress, and as such I'm proposing you move development to dt-tools. The name could need a little bit of work.

Considering that you are the maintainer of VHDL_LS, I could at least take some help with structuring the data and language server.

Rant about a new DTS format

I've been thinking of making a new DTS format that is more modern by having native conditionals, imports/modules and configuration. Maybe something functional like Nix? I don't think it will ever get adopted. And it'd need lots of design.

Afterthought: A language like Nix would work really well for passing configuration to imports (example "the old way": argument, usage) (sorry if this looks really complicated)

# board config file
let
  pm7250b_config = import ./pm7250b.nix { sid = 1; };
in {
  "soc@0"."display-subsystem@ae00000"."spmi@c440000" = pm7250b_config.spmi_bus;
}
# pm7250b.nix
{ sid }:
let spmi_usid = 0; in
{
  spmi_bus."pmic@${sid}" = {
    compatible = ["qcom,pm7250b" "qcom,spmi-pmic"];
    reg = [sid spmi_usid];
  };
}

This actually looks really good but it will never be adopted. It's also not native enough for generating from DTB. Another problem is that Nix is missing much of DTS's syntax like /bits/.

Nix is a domain-specific, purely functional, lazily evaluated, dynamically typed programming language. ref

Schottkyc137 commented 2 weeks ago

Afterthought: A part of the thought of fragility comes from the nature of the DTS format. If you accidentally add whitespace (e.g. node -name) it could error about having a minus in place of an equal sign. I might have code to recover from this though.

Yea, a similar problem comes along with the reference syntax (i.e., &reference is a reference whereas & reference are likely two illegal tokens). I 'solved' this issue by just having a lookahead that is large enough. Since node-names always end with a { (and property names end with a = or ;), this could be a viable solution.

It's because I don't plan to support odd macros like that. Nobody should use code like that in the wild for devicetree. It's a whole lot easier to deal with code ranges when you don't have to check that everything isn't from a macro.

I will say that I disagree with this statement. While I agree with the style guide, I think that generally, as designer of a language tool, one should not make any assumptions about the style one uses. If it's legal, it should parse. Disallowing the "weird" parts is the job of a linter. But I agree that this is a good solution that covers 99% of all cases.

I think merging our projects would be best. Working towards the same goal separately would be inefficient. Not to be rude, but I think that my project has more progress, and as such I'm proposing you move development to dt-tools. The name could need a little bit of work.

I am not unsympathetic to this idea. I can see that there is a lot of stuff on your project that is still lacking on mine and I am happy to work on whichever project can aid people more when struggling with embedded development . But the argument goes both ways. As far as I can tell, you do not have anything published on crates.io and using the language server requires installing the rust toolchain and manually pulling the repository from git. ginko and ginko_ls on the other hand have pre-built binaries for most operating systems and usage usually require not more than installing a plugin in your favourite editor. Additionally, while I try not cling too hard to my project and I think that FOSS development efforts especially in the area of embedded / close to hardware topics should be more centralised as there is a lack of people, I have spent a lot of time on this project and would hate to just see that go. Maybe there is a third way though. As I see it, your project is very good at efficiently lexing and parsing device-trees. My project, in contrast, shines in deploying the project and implementing the language server protocol. Maybe ginko and ginko_ls could stay as they are, but instead depend on and use your project(s) as backend? Probably, one would want to move the lsp implementation fully over to ginko and the parser + lever implementation fully over to dt-tools (or similarly, open for any discussions here). Naturally, my main contribution efforts would move dt-tools as this is where most of the heavy lifting is done. What do you think about that solution?

Rant about a new DTS format

Have you seen Pkl? Looks like a promising candidate to me as well. Unfortunately, I haven't had the chance to look at Nix, but the syntax looks quite nice. Does it have some builtin / easy to implement way to do validation of nodes?

axelkar commented 2 weeks ago

I will say that I disagree with this statement.

Does clangd support macros with partial syntax like that either?

When you can have macros everywhere, it's basically impossible to do refactoring using CST. The CST should always stringify to a single file's source code byte-by-byte. There should be no macro expansion before the CST.

But you're right in that tooling should support cases like that. Maybe there could be a slower single-pass mode for complex/weird cases, but it's something to come later.

As far as I can tell, you do not have anything published on crates.io and using the language server requires installing the rust toolchain and manually pulling the repository from git.

I haven't made any releases of dt-tools yet. It's still at pre-0.1.0. I should probably make a release to get users.

Probably, one would want to move the lsp implementation fully over to ginko and the parser + lever implementation fully over to dt-tools (or similarly, open for any discussions here). Naturally, my main contribution efforts would move dt-tools as this is where most of the heavy lifting is done. What do you think about that solution?

I think fragmentation like this would make development harder due to the need to version and export some analysis code and multiple to code analysis hosts. Rustc and rust-analyzer don't have stable implementation details and versioning in the internal crates. I'm not sure how crates.io would cope with it though.

On the other hand, providing libraries (with versioning) could increase adoption because projects could share more code.

For CLI tooling I'd like to reuse some of the analysis, especially if it can be cached to a temporary directory, and it should work for the LSP too.

Maybe sharing analysis between the CLI and LSP is bad though. Maybe it's better to do analysis and import management in a single pass in the CLI?

I'm open to ideas. Thanks if you're interested!

axelkar commented 2 weeks ago

Have you seen Pkl? Looks like a promising candidate to me as well.

Forgot to reply to these. Yes, I've heard of it before.

Does it have some builtin / easy to implement way to do validation of nodes?

No. I completely overlooked that part. Similar languages include Dhall, Nickel, KCL and Jsonnet. Just like DTS, all of the languages above (except I'm not sure about KCL) can be converted to JSON and validated to the JSON schemas.

Here's some requirements for a DTS killer:

  1. Functional programming is very important for composability
  2. Files shouldn't ever have any preprocessing, to ease parsing, error handling, refactoring, etc.
  3. As a replacement for preprocessor macros, the language should have support for variables, conditionals and loops.
  4. It should be easy to import other files with arguments. Importing multiple times with different arguments should be possible without any macro trickery.
  5. Could maybe have a built-in to import "legacy" DTS/DTSI files and C header files containing only macros.
  6. References and phandles are required
    • Reference in cell turns into phandle: interrupt-parent = < &mpic >;
    • Path in cell turns into phandle: interrupt-parent = < &{/soc/interrupt-controller@40000} >;
    • Reference outside cell turns into path: ethernet0 = &EMAC0;
    • Path outside cell turns into path
    • I can't even think of a way to make this work with functions
  7. To produce bit-by-bit DTB recreations, node children need to be kept in order. Nix always sorts attribute sets (also known as objects in JSON or nodes in DTS) alphanumerically during evaluation. DTB doen't
  8. /bits/ support
  9. Maybe differentiate between cells and values. DTB only cares about binary values
  10. Ok really tough one: Can have nodes and properties named the same:

    $ echo '/dts-v1/; / { a = <1>; a {}; };' | dtc -O dtb - | dtc -I dtb -O dts
    <stdin>:1.26-29: Warning (node_name_vs_property_name): /a: node name and property name conflict
    <stdout>: Warning (node_name_vs_property_name): /a: node name and property name conflict
    /dts-v1/;
    
    / {
        a = <0x01>;
    
        a {
        };
    };
  11. Maybe /delete-node/ and /delete-property/ support. If the language supports null values, you can just overwrite as null. Null should be kept separate from an empty property, usually signaling true: property-name;
  12. /memreserve/ should be supported in a clean way. Could just be extra metadata in the root function/attrset
  13. Devicetree overlay support?
Schottkyc137 commented 1 week ago

Just to conclude the conversation: I do not plan to fully move my development efforts to any other project because 1) I have put some significant effort into ginko. Maybe it's pride, but I just don't want let all of that go 2) ginko is somewhat popular, so this would also mean abandoning the projects in all editors; effectively making the tool deprecated a couple of months after it came out.

This doesn't mean that I am against a collaborations of sorts; I'm simply not in favour of abandoning this project fully for now. I might change my opinion in the future though; your project looks very promising.