c272 / iro4cli

An open-source rewrite of Iro, a grammar generator, supporting automatic VSCode & Atom extension generation.
GNU General Public License v3.0
35 stars 5 forks source link

Supporting ANTLR 4 or Langium grammars #11

Open svallory opened 1 year ago

svallory commented 1 year ago

Hi!

I was wondering how hard would it be to support ANTLR 4 or Langium grammars as input or to at least generate one of them as output.

The reason for the question is that both are used to create Language Servers. Langium is especially helpful to create VS Code extensions with language servers.

svallory commented 1 year ago

@c272 how hard would it be to contribute this generator?

c272 commented 1 year ago

@svallory Since Iro files are essentially just giant state machine definitions, processing input through a stack of states until input ends, I'll talk more generally about generating AST-based outputs from an Iro file, then talk about each specific output type.

We can theoretically generate a new AST node on each context push within the state machine, and calculate the available sub-nodes and possible terminals from the new context's push, pop and pattern definitions (This should end up being sufficient if we assume all contexts turned into nodes must have a pop). There would be obvious left-recursion that would need resolving.

Along with a patch to add parsing of UIDs, this seems achievable from my point of view. More timely would probably be the generation of a lexer-only output, leaving the parser up to a later stage, but that's a different discussion.

ANTLR4 Output There are differences in regex features/format between ANTLR and Iro which would need to be somehow resolved. For example, negative capture groups are in a different format, and ANTLR does not support JS-style lookbehind/lookahead. Not sure how we'd resolve this one nicely without just parsing and element-by-element converting the entire regex.

Langium Output This is entirely outside of my prior knowledge, but from having a glance at the grammar specification, it seems as if this would share much of the same work that would go into creating the ANTLR generator. The use of JS-style regex here would probably also remove any conversion work that would have to happen as in ANTLR.

Since they both essentially amount to an AST description, I imagine the best course of action here would be to create a mechanism for transforming parsed Iro grammars into a generic AST format, and then have that be the input to a compiler to any AST-based output formats such as ANTLR or Langium. It would be a fair amount of work, but definitely seems doable.

svallory commented 1 year ago

@c272 Sorry for the delay! Work has been crazy these past few days :/

There would be obvious left-recursion that would need resolving.

What do you mean? Does iro support it? I know that Langium does not.

More timely would probably be the generation of a lexer-only output, leaving the parser up to a later stage, but that's a different discussion.

I think it would be awesome to get a JSON representation of the AST and have a JSON schema for it. That way other people could create generators for other grammars, use iro4cli to generate the JSON, and take from there.

(just finished reading your comment, lol... we are on the same track)

If I could get a JSON representation of the iro file I can create an iro to langium converter.

Here's why I want this so bad:

For me, the hardest part of creating a new language is the syntax highlighting that uses TextMate. I can write a DSL grammar in two days but it takes two weeks to get the syntax highlighting working correctly! That's where all the value of iro is for me.

I asked about ANTLR4 simply because there already is an ANTLR4 to Langium converter. My goal is to use Langium. It is written in Typescript, like most IDEs these days, and it does a LOT automatically to create a plugin and a language server. It's an awesome tool.

Btw, why did you pick iro instead of another grammar, e.g. ANTLR4? It seems like you created everything from scratch anyways...

Well, the thing is, I don't want to have to specify my language in two grammars. During the initial phase of language design I experiment a lot with new constructs, and getting other people to try it is much easier if you have a good IDE extension, especially on the highlighting part.

So this is why I'm eager to have iro generate both Langium and TextMate and I can tell you, writing textmate files is the worst nightmare for every language designer. I could make iro a product, I would pay for it!

c272 commented 1 year ago

What do you mean? Does iro support it? I know that Langium does not.

Ignore that, evidently my brain wasn't working at the time... this shouldn't be a problem since push rules always start with a terminal anyway 🤦‍♂️.

Btw, why did you pick iro instead of another grammar, e.g. ANTLR4? It seems like you created everything from scratch anyways...

I was using the official Iro tool at the time, and created this so that I wasn't depending on the online proprietary version. Plus, ANTLR isn't really designed for attaching the required metadata for outputs like TextMate, Ace, etc.

So this is why I'm eager to have iro generate both Langium and TextMate

Absolutely agree! I think figuring out a good way to get quality AST output is the main issue here. I'm going to look into it as the main point for next release.