Output AST / AST + errors found

pherrymason commented 3 months ago

Context: https://discord.com/channels/650345951868747808/830497607998242884/1253238870212874260

it would be nice if the lsp could inform about errors present in a file now. There are various kinds of errors: sintax errors, and ¿compilation? errors.

The parser I use doesn't give me direct details of what sintax errors are there, much less other kind of errors that would arise while compiling.

How hard would it be that c3c could run in a ¿diagnose? mode so the output could be used by the lsp? The idea I had in my head was:

LSP takes ¿a file/whole project?. Run c3c (or any other tool) to the file/project. LSP takes structured errors thrown by c3c and build a diagnose that is sent to the editor.

Running this diagnose file by file has some limitations on what errors you can find, but maybe is a start. I'm concerned too about how efficient would all this be. I certainly don't want to be "recompiling" the whole project every key > stroke (some optimizations could be done on the LSP to delay that), I'm unaware of what parts of the compilation are more expensive/cheaper, but that's why I was thinking in a special mode for c3c to run that would only care about getting those possible diagnose errors Also, it can make more sense to be a separate application to not bloat the c3c if that makes sense, how hard would it be to reuse some parts of c3c outside it? ( ... ) So what c3c could provide is:

For files that do not parse: The error location and message of the first parse error of a file + the AST parsed so far for that file

For files that do parse: The AST parsed for that file

If parsing succeeds:

The AST for all files

The semantic errors found

We need to define the output format of these two possible outputs

Output AST of source

Currently, -P argument returns only (If I'm not mistaken) declared symbols and is lacking some useful information to be able to cross it with original source code.

Points of improvement:

The full AST representation could be very useful.
Column and Character positions (and ranges) inside source code. This info should be attached to every meaningful node. For example: For the source int a = call(arg1,arg2,arg3); Every symbol should have its position ranges attached:
- int
- a
- call
- arg1,
- arg2,
- arg3
Be able to retrieve the AST of only one file so an error happening in a different file does not affect the obtention of a sane one.

Posible example of ast tree:

  module foo;
  int a = call(arg1,arg2,arg3);

  {
    "module": {
        "name": "foo",
        "doc_range": {
            start: [0,0],
            end: [1, 31]
        },
        "statements": [
            {
                "node_type": "var_declaration",
                "names": [
                    {
                        "node_type": "identifier",
                        "identifier": "a",
                        "path": "foo",
                        "doc_range": {
                            start: [1,6],
                            end: [1, 7]
                        }
                    }
                ],
                "type": {
                    "identifier": {
                        "node_type": "identifier",
                        "identifier": "int",
                        "path": null,
                        "doc_range": {
                            start: [1,2],
                            end: [1, 5]
                        }
                    },
                    ...
                }
                "initialized": {
                    "node_type": "call_invocation",
                    "identifier": {
                        "node_type": "identifier",
                        "identifier": "call",
                        ...
                    },
                    args: [
                        ...
                    ]
                }
            }
        ]
    }
  }

Semantic Errors

Format: ¿Json?

Proposal:

Root
{
"file": String. Path to the parsed file.
"errors": Array<Error>
}

Error
{
 "description": String. Description of the error.
 "line": uint. Line number where the error happened.
 "character": uint. Character number where the error happened.
}

lerno commented 3 months ago

You can look at --test which already prints out to a known format.

lerno commented 3 months ago

Here is an example output when running code with a semantic error:

Error|debugstuff.c3|9|'foo' could not be found, did you spell it right?

pherrymason commented 3 months ago

Great, for the errors that might work. Will try it and come with any feedback.

lerno commented 3 months ago

Any updates?

pherrymason commented 3 months ago

Yes! I'm missing the character position where the error is happening. See example:

Running c3c build gives both line and character:

➜  test-c3 git:(main) ✗ c3c build
 3: import app;
 4:
 5: fn void main() {
 6:     io::prin
            ^^^^
(/xxx/test-c3/src/main.c3:6:9) Error: Expected a type here.

Running with --test:

➜  test-c3 git:(main) ✗ c3c build --test
Error|main.c3|6|Expected a type here. // <-- here I only get the line, I would need the character too
➜  test-c3 git:(main) ✗

Also, totally unrelated, but I cannot make the --path argument of c3c build to work, it complains the folder not containing a project.json file.

lerno commented 3 months ago

Can you file the --path as a bug, and then file the extension of test as an enhancement request? It's easily fixed, I just need to track it to do it when I have time.

pherrymason commented 3 months ago

It would also help a lot if file would contain its path. I'm wondering if a relative to the root of the project would suffice. What do you think? Do you want me to open a ticket for it?

lerno commented 3 months ago

Anything else we need from this one?

pherrymason commented 3 months ago

The errors parts is solved. However, the full AST generation would be really useful for easier and better code analysis.

lerno commented 2 months ago

What kind of information would you want for that? There is a lot of info...

pherrymason commented 2 months ago

Ideally, position ranges together with the represented string + node type. But sincerely this can wait, I'm sure there are other more important stuff to be resolved or improved!

pherrymason commented 2 months ago

Let me give you some more context. For the LSP I'm currently using Treesitter's parser to get a CST (context syntax tree), this gives me structured info about the source code, but it is not a fully AST (It has too many details about the sourcecode itself that I don't need). Having an AST would simplify the structure I need to iterate and would allow me to improve some analysis I do in the LSP.

I'm currently trying to convert the CST to an AST in the LSP itself, that's why I said there is no urge in having this be solved by c3c itself. I even still need to validate my assumptions working with an AST.

However, if validated, in the long term is not very sustainable to rely on the conversion of the treesitter's CST to AST, as any modification on the treesitter's grammar will require to adapt the conversion code, and that will be very tedious.

The c3c beeing able to emit a full AST of the project would allow to evolve together with c3c itself automatically, and avoid breaks every time treesitter grammar is updated.

c3lang / c3c

Output AST / AST + errors found #1329

Output AST of source

Semantic Errors