chevrotain as the 'generated' code from PEG

Chevrotain / chevrotain

Parser Building Toolkit for JavaScript

https://chevrotain.io

Apache License 2.0

2.5k stars 206 forks source link

chevrotain as the 'generated' code from PEG #293

Closed KrishnaPG closed 8 years ago

KrishnaPG commented 8 years ago

While reviewing different DSL generators, came across this.

From the performance perspective, this one is beating the others. However, from the maintenance perspective this looks slightly complex (so much of repetitive code). Also, it would be impossible to transform the original grammar (expressed by the code) to other languages (since there is no separate grammar or generator).

A grammar essentially captures just the symbols and their transitions/groups. In that sense the formalism of PEG or similar are quite tidy and easy to maintain, as well can have multiple language implementations. However, the generated code from PEG parser is not that performance when compared with this library.

One question I have then is:

Does it makes sense to have the grammar intent be kept in the tidy format of, say PEG or similar, and the usual PEG parsers, such as peg.js have the code generated that uses this chevrotain API.

That way, we can gain

the original grammar is separate and easy to reuse part-wise and change (especially when it grows large)
can have the performance benefits
can have implementations from other languages too (since the original grammar is abstracted out as separate and format grammar file)

bd82 commented 8 years ago

Thanks for the feedback @KrishnaPG. This will be a long reply so bear with me 😄.

From the performance perspective, this one is beating the others. However, from the maintenance perspective this looks slightly complex (so much of repetitive code). Also, it would be impossible to transform the original grammar (expressed by the code) to other languages (since there is no separate grammar or generator).

Exported Grammar Structure:

It is not completely accurate. The Grammar's structure is exported and can be accessed in a programmatic manner

Perhaps I should create an example that explicitly demonstrates this?
This means that given another parsing library with identical semantics for your specific grammar you could run a node.js script to generate a grammar in that library's syntax.

This is actually how the parser is implemented on the inside, the grammar's structure must be known to perform lookahead decisions (which branch to take), to detect ambiguities and left recursion or even just to draw the syntax diagrams.

On Maintenance and TCO:

I agree that writing the grammar in pure JavaScript causes a more verbose and uglier code. This is one of the trade-offs of the approach taken by Chevrotain being an internal JavaScript DSL.

But this does not necessarily mean that it is harder to maintain:

Yes the syntax of the grammar is uglier, but it is could also be easier to write because any JavaScript IDE may be used with its full capabilities (syntax highlighting / goto definition / find usages / ...) This gets even worse when you add code snippets inside your grammar, as even if you have an editor which supports your favorite parser generator's syntax, does it also support your favorites target language syntax/semantics too?
Reading and Writing code is only the beginning, What about debugging? can you add breakpoints inside a formal grammar definition? With Chevrotain you can add a breakpoint anywhere just as you would for any other JavaScript code.
Lack of generated Code can actually be claimed to reduce the complexity of a project. I have seen projects where 20K+ lines of of generated Antlr3 code was committed to the source code Granted this is an extreme case and can be resolved, but not having to deal with generation steps / file watchers does reduce complexity (even if by just a little). I would describe this advantage this as the removal of one more "Leaky Abstraction".

To summarize: There are both advantages and disadvantages to having no separate code generation phase. Particularity form a maintenance and TCO perspective. Which approach has the advantage depends on the specific use case (Grammar Generator used , available Editor tools for that grammar generator, Quality of Editor used for the target language, how much debugging is expected, ect...).

Reducing Grammar verbosity and increasing readability:

The bigger problem imho is that grammar actions (a.k.a semantic actions) are embedded as part of the grammar. this drastically reduces readability and maintainability as it breaks the single responsibility principle.

Example1:

Example2:

It is easy to notice that the pure versions are so much more readable and understandable.

Antlr4 has resolved this problem by having the ability to "listeners" to perform grammar actions which are defined completely separately from the grammar. I've had some thoughts on how to implement this feature with Chevrotain. If I can achieve that capability I believe 80%+ of the problem will be resolved.

Using ES6 Syntax to increase readability.

ES6's "fat arrow" and "class" keyword can be used to reduce the verbosity.

() => {}

Example1 - XML Example2 - JSON

Using Chevrotain as a compilation Target

Sorry for only now getting to your original question 😄 Will be answered in a separate comment.

Edited as I did not realize the comment was not a quote of someone else...
Edited to add ES6 examples.

bd82 commented 8 years ago

Using Chevrotain as a compilation target for PEG.js / Parser Generators:

Does it makes sense to have the grammar intent be kept in the tidy format of, say PEG or similar, and the usual PEG parsers, such as peg.js have the code generated that uses this chevrotain API.

There are a few problems with this:

Biggest issue: The semantics of different Parsing solutions are often similar but not identical, examples:
- PEG.js does not have a separate tokenizing phase while Chevrotain does.
- Antlr4 is LL(*) with support for handling some cases of Left Recursion, while Chevrotain is LL(K) and will detect and fail on any use of left recursion.
- This means that not all grammars can be directly machine translated between parsing libraries. and worse yet some could be directly translated but will have slight semantic difference which could only be discovered at runtime for specific inputs.
The grammar is not everything, a pure grammar won't accomplish much, once embedded actions are added this increases the complexity of the translation process.
Chevrotain may be very fast, but that is not because there is anything inherently fast in its approach. It is fast in spite of it, due to the massive amounts of performance optimizations it does. Having no compile step is a big disadvantage when it comes to maximum performance potential. Just the use of anonymous functions in the grammar is horrible from a performance perspective. What I'm getting at is that if I wanted to implement a compile target for parser generators (some sort of Parser Assembly Language) that maximizes performance then I would have created a completely different project...

bd82 commented 8 years ago

Creating a Tidy ENBF like format to specify Chevrotain Grammars.

While using Chevrotain as a compilation target for other parsing libraries seems problematic. Creating an external DSL that compiles to Chevrotain's internal DSL is possible.

Such a concept would be even more viable if/once the embedded actions have been separated from the grammar and the grammar file (either external or internal DSL) will just be a pure grammar file as it would make the generation phase very simple.

This kind of feature is interesting but not currently of a high priority for me, nor am I even sure it should be part of Chevrotain due to feature scope creep and maintenance concerns. But if the use case is only dealing with pure grammars it should not be too complex and can be easily developed separately from Chevrotain as a small library. The most difficult part of such a feature would be developing the grammar of the external DSL, but that will just be some variant of BNF/EBNF so once again not too complex, see: Example of Antlr Grammar specified in Antlr syntax.

If you are interested in creating such a project I suggest following the issue for support of pure grammars with separate actions. And only attempting this once productive pure grammars are supported.

KrishnaPG commented 8 years ago

Thank you @bd82 for the detailed answer. I agree with your points on maintenance (especially the debug capability and embedded actions).

I like the Chevrotain approach of Javascript itself being the DSL - it solves multiple problems as you have indicated. However, the challenge is, having the same grammar be parsed by other languages, say C/C++.

On the other hand, I think we really do not need to have the same grammar be parsed by other languages, if the underlying actions can somehow are available in the DSL irrespective of their language.

Let me explain with a simple example. This is something we are currently trying to solve and any help / info from Chevrotain side could greatly help.

Use case: Consider the below code in a custom DSL:

  read_input | update_chart | ( > 10) ? raise_alarm;

The read_input, raise_alarm and update_chart are provided by domain specific implementations, such as, for example:

implementation 1 (say, in C++)

void read_input() { 
   // read from standard input
}
void raise_alarm() {
  // write to error console
}
void update_chart() {
  // write to console output
}

implementation 2 (say, in Javascript)

function read_input() {
   // ajax load the input for every 5 seconds
 }
function raise_alarm() {
  // POST data to error log on server and send eMail
 }
function update_chart() {
  // update the chart on web-page UI
}

Now, with Autobhan / WAMP and DeepStream etc. it is pretty straightforward and easy to inter-operate and invoke methods from different languages with ease.

What is missing, rather, is the ability to create dynamic DSLs that are formed out of the available underlying methods and knowing when to invoke what (like a parser event).

For example, consider something like below:

import 'implementation2';

read_input | update_chart | ( > 10) ? raise_alarm;

After the above import statement, all functions / actions provided by the implementation2 should be available as tokens in the current DSL, and the parser should be capable of invoking them (in the order governed by the DSL's operator semantics, say left-to-right or right-to-left etc.).

In other words, the valid tokens of the DSL are not static, but rather are varying, leading to a dynamic DSL that is populated by the underlying actions (implemented by any language and invoked through RPC, such as WAMP or DeepStream) - but the language itself is static (such as the operator precedence etc.)

Use case 2: To give another example, consider the below DSL from music domain:

  C C D G A

Each note (such as C, D etc.) has an underlying MIDI action supported by C/C++ implementation as below:

function C() { play_midi_note('c'); }
function D() { play_midi_note('d'); }
 // ...

In this case the notes are valid tokens in the DSL. But the notes need not be C D ... always, they could just be Do Re Mi ... etc. in which case:

import `solfège`;
 Do Re Mi Do ...

solfège implementation:

function Do() { play_midi_note('...'); }
function Re() { play_midi_note('...'); }
 // ...

Question:

How to implement such dynamic DSL (that has varying tokens) ?

The goal is: Consider a text editor that supports this kind of music DSL. As user keeps entering the notes (such as C D etc..) the editor should actually execute the underlying action for each note (in this case play the actual note) that the user just now entered.

This requires the text editor to support dynamic DSLs that can be loaded on the fly and know when to invoke the underlying action for the token.

bd82 commented 8 years ago

The background information helps, I think I now understand your use case.

On Dynamic Tokens with fixed structure:

How to implement such dynamic DSL (that has varying tokens) ?

Chevrotain has a concept of Token inheritance which allows to match Dynamically Defined Tokens Additionally Chevrotain also supports some mechanisms of Grammar Inheritance.

The dynamically defined Tokens should be enough for your the use case you described, but Grammar inheritance could provide a bit more dynamic power if you also want to dynamically change the structure of the language too.

Overall solution and separation of concerns:

I generally recommend separating the syntactic analysis from any business logic, this is doubly true in a complex use case such as yours.

I think that your parser should at no point perform the actual RPC calls, instead it should create an intermediate data structure (AST?) which represents those custom user actions. This data structure should preferably be serializable (to json?) to allow transmitting over the "wire". What this will accomplish is allow you to invoke the user actions in multiple runtimes (JS/Java/C++/.NET/...) as the implementer of such an invoker no longer needs to deal with the difficulty of parsing your complex language, only read a simple protocol (json?).

On building Editors:

Building language editors parsers has different requirements from the parser than building standard compilers specifically:

Fault Tolerance & Error Recovery - for dealing with partially valid input.
Incremental / Partial parsing - can be helped by using multiple start rules.
Providing syntactic content assist

Chevrotain was initially created to be part of both a compiler solution and an editor solution which is why it has those capabilities. but these capabilities are just pre-requisites / enablers. You may also need to design your data structures to support partial parsing results, add logic in your editor to only re-parse the last text block modified by the user, ect...

Just keep in mind that the "standard" parsing flow of whole input from beginning to end may not be sufficient for building an editor.

Potential Concerns:

Relying on Chevrotain specific(or uncommon) features such as dynamically defined tokens and error recovery will greatly limit the potential for migration to another parsing library in the future.
Mixing many unrelated languages into one "super grammar" could cause unintended ambiguities, perhaps dynamically mixing in only the sub languages the specific user enabled would be safer.
- I can try creating such an example using grammar inheritance if this interests you.

KrishnaPG commented 8 years ago

Thank you @bd82

The grammar inheritance sounds interesting - but it looks like the parser has to have the knowledge of all tokens before-hand.

Rather, what I am looking at is: a token list (and their associated actions) that can grow and shrink at runtime.

For example, consider an editor app. When it starts running, it would not know which tokens the user might type. But once the user enters import xyz then it would know the list of valid tokens the user is expected to enter from then on. And by the time the user completes entering his logic (in the custom xyz DSL) in that editor, the editor would have already figured out the AST required to process that logic (and the corresponding actions to take to execute it). This AST could then be distributed as work definition to some cluster of machines around the world, and each machine will either execute the AST action as it is, or convert into native binary for performance and invoke the underlying actions (each possibly provided by different language binaries).

Static parsers (such as those expressed with PEG grammars) would not be able to achieve that kind of feat - and I think Chevrotain is the closest bet (given that JS is its DSL, which is dynamic by nature) for this kind of task. Just that I am not sure how to actually achieve it.

One thing I can tell is, requirements for dyanamic DSLs is on rise in the industry, and if Chevrotain can somehow illustrate its strength in this area, it is going to play major role in the evolution of next gen tech surrounding the use-cases, such as IOT and edge-analytics (which require business logic work definitions to be programmable by user that is usually powered by Polyglot functional providers underneath).

bd82 commented 8 years ago

Some Questions:

Do you want just the list of valid tokens to be changed at runtime or also the grammar's structure too?
Are we dealing we structured languages here (Context Free) or natural languages?
Can we define bounds (implicit or explicit), i.e where each sub DSL has started and ended?
Can we limit the imports to the top of the text?
can a single text include multiple unrelated DSLs ?

A sample input case:

// imports
import smart_home_dsl;
import smart_home_temperature_sensor_addon_dsl;
import smart_home_aircon_addon_dsl;

// user's code
connect to my_home in my_ip_address.
connect to living_room_temperature_sensor in my_home.
connect to living_room_air_con in my_home,

IF living_room_temperature_sensor's reading is greater than 25 degrees celsius
   THEN activate living_room_air_com at full blast

Is this a valid example for the discussion?

KrishnaPG commented 8 years ago

Great questions @bd82 . The sample input you have demonstrated is perfectly valid example. Will try to answer the questions by building on that example.

Are we dealing we structured languages here (Context Free) or natural languages?

Structured languages. No ambiguity is allowed.

Can we define bounds (implicit or explicit), i.e where each sub DSL has started and ended?

I think this one is closely related to the other question below, hence will answer combined

Do you want just the list of valid tokens to be changed at runtime or also the grammar's structure too?

The Grammars can change, and perhaps one way to maintain dis-ambiguity is with bounds either implicitly or explicitly (where each sub DSL started/ended).

Let me illustrate with an example. Extending your smart home example, consider the below

import alarm_dsl;

IF living_room_temperature_sensor's templerature > 25 degrees
    THEN play music_alarm {
        C C D G C G G
    }

The play music_alarm allows different music grammar that is comprised of music notes such as C, D, G etc... and its grammatical meaning is entirely different from the parent / host grammar (and within the bounds of { } ).

Now, consider that the music_alarm supports repeating the alarm some number of time (say 3 times as expressed below):

    THEN play music_alarm {
        C C D G C G G
    } X 3

Then such a play music_alarm grammatical structure could possibly be expressed as below:

// file: alarm_dsl

import music_string_dsl;

definition: 
       music_alarm {MUSIC_STRING} [X NUMBER]

or it could be more extensive as below:

// file: alarm_dsl

import music_string_dsl;
import calcluator_dsl;
import sql_dsl;

definition: 
      music_alarm {MUSIC_STRING} [X NUMBER  |  (CALCULATION)  |  {SQL_STATEMENT}]

which allows one to write something like below utilizing the Calculator grammar:

    THEN play music_alarm {
        C C D G C G G
    } X (1 + 2)

or something like below utilizing the SQL grammar:

    THEN play music_alarm {
        C C D G C G G
    } X { SELECT alarm_count FROM user_preferences }

Extension Plugins One possibility I would like to note here is: say, the alarm vendor supplies only the {MUSIC_STRING} [X NUMBER] as the original grammar (without the calculator or sql support), user may (on his end) use his own (possible third-party) extensions such as the below to extend the vendor's grammar.

// file: number_extension_dsl
redefine:
     NUMBER: NUMBER  |  (CALCULATION)  |  {SQL_STATEMENT}

By importing the above extension_dsl user can use SQL statements in place where the alarm vendor originally only supported numbers.

Can we limit the imports to the top of the text?

Should be fine with that, but not sure if one can know the list of all imports upfront.

For example,

    THEN play music_alarm {
        import music_dsl;          // <<- scoped import here
        C C D G C G G
    }

This is required when we are merging user-supplied pieces with a default template (such as happens in admin-dashboards). For example, say the above music piece is actually gathered from user as input on a web-page textbox. Then users would only see a music-entry editor / text box (without the surrounding context) and enters:

        import music_dsl; 
        C C D G C G G

which will then be merged with the outer base-template play music_alarm { % user_string_here % } loaded from DB.

In such case, we would not know which all DSL to be imported up front. User may choose to use his own custom compatible music dsl (uploaded from his computer).

Since import statement essentially results in an AST being loaded into memory (possibly either replacing / extending or standalone), it would be good to support the import statement anywhere.

can a single text include multiple unrelated DSLs ?

As of now, I am not really sure if there are any good use-cases for unrelated DSLs in a single file - but I think if scoped DSLs are supported (with some implicit / explicit start/end block indications), putting unrelated DSLs in a single file should not be a problem. Not sure, though. With your better expertise in ASTs / DSLs, I think you are better judge on this one to decide if it s good to have or not. Will leave it to you.

bd82 commented 8 years ago

Thanks @KrishnaPG I think these answer help for farther clarify the use case.

Back to a much higher level now that we have a shared understanding of the use case.

The (though) high level requirements here are:

Language Extensibility by external plugins.
Dynamic Language importing/enabling.
- Optionally with scoping support.

There are obviously other issues, such as generic (extensible?) data structures to represent the sub languages's output, a portable protocol to communicate those actions, ... but those issues while not trivial are more standard well understood engineering issues...

Alternative approach.

Have you considered that instead of extending the syntax of some "meta-language" to create new semantics (example: controlling smart home semantics) you can only deal with describing the semantics without inventing new syntax by creating internal DSLs?

Examples:

Chevrotain itself is a prime example of this approach as it did not create a new syntax to define grammars (external DSL) but is instead an internal JavaScript DSL.
Rest Assured Java rest services testing DSL
Creating Domain Specific Languages with Scala
Baysick Basic in Scala

Scala is worth extra emphasis here because it is very friendly towards internal DSLs. It even has macros(like lisp) support which is something I've had to pseudo hack together to implement Chevrotain in JavaScript .

This approach will solve many of the problems of creating such an extensible meta language. But at the cost of uglier code (with something like Scala not as ugly...). and lower potentially lower performance.

Thoughts on the feasibility of creating such a meta language using Chevrotain.

Will add this in a separate comment, hopefully tomorrow.

KrishnaPG commented 8 years ago

Thank you @bd82

You are right w.r.to not creating new "syntaxes". For that matter, any one Turing-complete language is capable of solving all computable problems with just one syntax (by definition). Say, simple, plain-old C is enough.

However, I would like to impress that it is equally important to be able to create new syntaxes.

The fact that we do have plethora of languages (more specifically, syntaxes) already at the high-level itself, such as C++, Javascript, Python, R etc., clearly indicate what the underlying problem is.

I will not go into the language wars, but hope to clarify the problem by illustrating few pain points that we are facing in the industry right now:

Consider a CEP (compex event processing) engine that requires real-time streaming calculations on the incoming streaming data. It requires more expressiveness in terms of "data flow". The syntax is more apt to write something like

    source | transform1 | transform2 | output

than something that requires heavily nested callbacks or promises (JS style), or heavily sequential control flow (such as the C style), though these JS/C etc. can perfectly do the job and perhaps with more efficiently.

Consider a cultural document preservation system, such as music notation. It requires tabulature style of expressiveness that allows both "sequential and parallel" notes illustration at the same time. This kind of syntax is hard to express with any existing programming language.

And the problem is, the syntax that suits well for CEP, does not just work for the Music Notation.

We always tried to keep it low and stay true to one syntax paradigm, but it doesn't work. For example,

Our CFugue library its a mater piece for Carnatic Music, straight from C/C++. But it is just not enough for what we want to achieve.
The M2M gap for C/C++ is too large: http://gopalakrishna.palem.in/blog/RestfulC++.html Asynchronous multi-core execution synatx is hard to capture on fundamentally sequential syntax designs

Now, one may argue that C/C++ is the wrong choice, stick to JS or Python or xyz language (syntax) for all things - well each language has its own problems.

It is just NOT possible to create one Syntax to rule them all. The very fact that each of the programming languages / syntaxes keep evolving with each version (say ES6, Python 3 etc..), is itself clear indication that every syntax is missing something.

Now, consider this scenario:

Instead of one programming language / syntax growing over years with different versions vertically (enhancing its syntax along the way), think of different syntaxes evolving horizontally at the same time each tailor built for different purposes. That horizontal set of syntaxes is our DSLs.

This is just like modularity concept in our programming - one module to do just one thing and doing it right. But we are applying that modularity to a syntax/language.

Why am I stressing on supporting multiple syntaxes

The earlier days, some one created a language / syntax and all used it. With LLVM there are already many languages / syntaxes out there created by every other person.

IOT brings the capability to connect multiple devices from different domains (from wrist watch to refrigerator to aeroplane and what not). It is impossible to convince all device vendors to agree on one meta language. Even if the vendors want to create one syntax, given the diverse domains, they would eventually endup with creating whole set of DSLs one for each domain (just like JS for web, C for system programming etc.)

Software creators (such as those who create IOT analytics platforms) ultimately has to support all these DSLs, and at any given point of time the software would not know which exact DSL it can expect (the devices may connect and go away randomly).

Hence the need for on the fly syntax and semantic adaptable editors / interpreters / generators.

Bottom Line is:

I request you to keep the option of being able to create and work with different set of syntaxes, as important as being able to extend the semantics of one syntax.

The former takes the horizontal approach, while the later takes the vertical approach - both of which are needed.

KrishnaPG commented 8 years ago

Also, a sideline thought:

Is it possible to consider the AST as the universal meta language and treat all the syntaxes just as aliases at the semantic layer?

When a new syntactic expression is inserted into a parent code, if we can treat it as AST subtree insertion into parent AST (mounted at that insertion point), then I think it is possible to have different syntaxes work together.

I am taking this concept directly from the language workbenches but I see no reason why it cannot be extended here.

bd82 commented 8 years ago

On External vs Internal DSLs.

I'm not saying there is only one way, I'm just worried that often people underestimate the complexity of creating / maintaining new languages and are too quick to create a new syntax where an internal DSL could provide most of the benefits at fraction of the cost.

Perhaps the use case you are describing is the exception to the rule. Both due to the highly varied domains present and because many of these languages seem to be(?) much smaller than your average programing language in terms of size and complexity.

Perhaps more then just creating a platform to help "Software creators" build things such as "IOT analytics platforms" a larger scope is required. providing a language toolkit to build these special domain languages for "IOT device vendors".

Basically if company X provides "smart home controllers" that has an embedded javascript runtime. They could implement their own DSL using this hypothetical language toolkit. And the "Software creators" which build apps around those can "re-use" some of the products of the toolkit to build their editors.

Basically whoever writes the compiler also writes the language services for editor use cases. This is already happening in modern compilers (Scala/TypeScript/new C# compiler). But assuming dozens or hundreds of these unique DSLs each implemented by a different team/company/org with different background in language design / compilers will create real mess :sad.

bd82 commented 8 years ago

Implementing "On the fly adaptable syntax & semantics using Chevrotain.

Semantics.

Firstly Chevrotain only deals with syntax, not semantics. if you make sure to create a simple structure representing only the syntax (concrete parse tree) as the output of a parser then all adaptable/custom semantics concerns can be resolved.

Basically never embed code actions in your parser that deal with semantics. Chevrotain could help with this however if I succeed in implementing automatic concrete parse tree creation and an API to invoke custom actions on it. (similar to what Antlr4 and ohm.js do). So grammars will remain completely pure and all semantic actions can be invoked on the fly and customized.

Syntax.

Chevrotain does not perform any code generation from a grammar representation. which can make it more adaptable, however it does the opposite of generating a grammar representation derived from the parser's source code. This means that Chevrotain expects most information about the language to already be present during parser initialization. and changes to the grammar to be represent as changes to that same source code.

What this means is that Chevrotain Parsers can be adapted during runtime, but only before parser initialization, not during the parsing flow itself. Chevrotain is also written in JavaScript(TypeScript) which means that this limitation can probably be worked around, however this is not something I would recommend because if I wanted a truly adaptable "on the fly" solution I would go for a completely interpreted parsing engine which is much more suited for such a requirement. (at the cost of one order of magnitude slower performance).

Simulating on the fly parser adaptability.

If we assume two constraints.

identifiable lexical delimiters around blocks of each DSL group.
Import statement may only appear in the top (header) of each block.

A possible approach would be to perform two phase parsing.

Phase one - split the input into blocks/sections where each block has a header and contents.
- The contents should be read as a raw string.
Phase two - Using the information from each header, an appropriate parser will be assembled at runtime to deal with the content of the relevant block.

Notes:

The delimiters may be implicit (always one at the start and end of a file) or explicit.
There may be a need to lexically identify the header sections as well.
These delimiters may make the source code uglier, so they do not necessarily have to be visually displayed in the Editor's UI at all times.

On assembling an appropriate parser on demand (grammar extensibility).

Chevrotain has some dynamic features that can help with this, such as token inheritance and grammar inheritance, additionally I've created Grammar mix ins in a proto version of Chevrotain so expanding on the assembly / dynamic creation capabilities seems possible.

Notes:

Such extensions do not necessarily have to be a part of the core Chevrotain library.
Such extensions may lead to ambiguities.
- Imagine two grammar plugins to some language both adding a new statement that starts with "A B (C)" grammar, because "(C)" may require infinite lookahead to expand, the grammar is no longer LL(K).

KrishnaPG commented 8 years ago

Thank you @bd82

What this means is that Chevrotain Parsers can be adapted during runtime, but only before parser initialization, not during the parsing flow itself.

I am not much expert with parsers or their implementation, but to think aloud, what is preventing from doing something like below:

Say, each import statement causes a parser object to be created (that knows how to parse that particular dls). So, essentially when one says:

import  "music.dsl";
import "sql.dsl";

the above statement results in two parser objects in memory, say: parser.music and parser.sql (and of-course the host/parent parser object itself is already there, making it total 3).

Then when one encounters something like this kind of grammer, for example:

definition: 
      music_alarm {MUSIC_STRING}

where MUSIC_STRING was previously defined in the music.dsl then when the host / parent parser encounters something like below:

music_alarm {
  C D C G
}

then it knows that music_alarm by definition expects a MUSIC_STRING object to follow it (within braces {} ), so it would invoke the parser.music to parse the content { C D C G } and moves on with rest of the stuff as is.

In other words, the host / parent parser just acts as a container of different sub-parsers and keep delegating it to the right parser (based on the definition supplied by user).

Since in Javascript objects can be extended anytime, the parser.music, parser.sql and few others such as parser.c++ etc. can be created on the fly (based on the import statement).

Would that kind of thing possible with Chevrotain?

bd82 commented 8 years ago

@KrishnaPG What you are describing above is not very different from what I've described. The differences are granularity (statement level vs block level choice of parser) and position of the import statements (start of document vs start of blocks).

The reason I suggested a clearer division of one block per grammar is due to the following catch 22 situation:

You said:

In other words, the host / parent parser just acts as a container of different sub-parsers and keep delegating it to the right parser (based on the definition supplied by user).


import  "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

// SQL
select name, age from customer where age > 60

// Music
music_alarm {
  C D C G
}

// Smart Home Robot
activate robotic helper bob and
insert three bottles of soda to the fridge.

The big question is: how do you decide which parser to activate for each line/statement? In the smart home DSL has the word insert can start a statement but that is also true for the SQL dsl, maybe in some other DSL "insert" will be recognized as a wildcard "Identifier" (like it would according to Java Lexer rules).

Basically how can you tell which parser to use without first parsing the sentence ? This is a chicken and egg problem, cyclic dependency what came first?

It can get worse as in some edge cases maybe the same line/sentence can be parsed by different parsers with different meaning? or what if different DSLs have different whitespace sensitivity rules and in a certain combination of statements the previous one ruins the whitespace for the following one.

The approach I suggested earlier tries to resolve this by creating very clear delimiters unambiguous lexical rules on which parser should be used where. so after a single pass with the "sections dividing parser" we can now do as you suggested as run each appropriate parser on each appropriate input. but at least we know how to match one to the other.

But it could be a bit ugly because of the need for delimiters.

KrishnaPG commented 8 years ago

Thank you @bd82

Now I see the disconnect. The host parser, supports multiple syntaxes, but the syntaxes cannot appear anywhere randomly in the text. They have to follow the host syntax grammar (the definition supplied by user I was referring to earlier).

You are right about the delimiters - the only difference is: I am suggesting not to keep it fixed, and rather be driven by the user. Say, user indicates with the first import what the base syntax to expect for the rest of the file.

For example, consider a base syntax definition file as below:

// file: base.dsl
import  "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

<SQL_STATEMENT> WS <MUSIC_ALARM_STATEMENT> WS <MULTIPLE_ROBOT_STATEMENTS>

Now, the host parser knows to expect, SQL statements first followed by MUSIC followed by ROBOT statements (from the above base dsl definition), so your example:

import "base.dsl";

select name, age from customer where age > 60

music_alarm {
  C D C G
}

activate robotic helper bob and
insert three bottles of soda to the fridge.

becomes parsable. Agreed, this is rigid, but if that is what user wants let it be.

Or else, my personal favorite (from Latex style):

// file: base.dsl

import  "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

definition: 
\Sql { SQL_STATEMENT }

definition: 
\Music { MUSIC_STATEMENT }

definition: 
\Smart { SMART_HOME_STATEMENTS }

Here, I am (trying) to define my own delimiters, such as \Sql etc. that should indicate the parser to expect SQL_STATEMENT to follow where ever a \Sql entry appears. That way I can re-write your example as:

import "base.dsl";

\Sql { select name, age from customer where age > 60 }

\Music { 
    music_alarm {
        C D C G
    }
}

\Smart {
    activate robotic helper bob and
    insert three bottles of soda to the fridge.
}

\Sql { select something from somewhere  }

The order could be changed and the blocks could start anywhere, since they are all top-level definitions that indicate the host parser which one to delegate to.

To answer your question: how to decide which parser to activate for each line/statement?

That would be determined by the host grammar supplied by the user.

How does one know what is the host grammar?

One (resonable) way perhaps is to consider what ever is the first import, it specifies the host grammar (that tells how to parse the rest of that file)

In other words, user code is a self-describing entity that tells (with the first import) how to parse the rest of its content. (Like xml-schema that is embedded inside the XML files).

This Self-describing machine-digestible entity concept is very important as we are leaving the information-age towards machine-learning, and is inline with the OKFN (Open Knowledge Foundation)'s Data Package standards

Imagine a csv file that has self-describing info on what each column's datatype is (in parser's terms, how to parser each column and derive the actual value).

Analytics pipelines currently suffer from the lack of this info (and something the data packages and other initiatives from the OKFN aim to solve in general). IOT devices need this capability to be able to "communicate" with diverse devices and invoke each others' actions.

Bottom line I think you and I are almost on the same line in this regard, expect I am suggesting to make the host's grammar also be configurable / adaptable, supplied by user - which means not fixing the delimiters ourselves, but by the user. The goal is to achieve self describing user scripts.

Which exact syntax one uses to create those base/host grammar self-descriptions then? : one could use any of the standard grammar syntaxes, such as PEG or Chevrotain etc., with support for additional commands such as import needed to make the resulting grammar on the fly parasable.

bd82 commented 8 years ago

I think you and I are almost on the same line in this regard, expect I am suggesting to make the host's grammar also be configurable / adaptable, supplied by user - which means not fixing the delimiters ourselves, but by the user. The goal is to achieve self describing user scripts.

Yes, what you are describing is just an extension of my more naive and "fixed" approach.

Given:

description of an host grammar.

// file: base.dsl

import  "music.dsl";
import "sql.dsl";
import "smart_home.dsl";

definition: 
\Sql { SQL_STATEMENT }

definition: 
\Music { MUSIC_STATEMENT }

definition: 
\Smart { SMART_HOME_STATEMENTS }

Implementations of parsers for the sub-grammars (music.dsl / sql.dsl / smart_home.dsl).

I would (Once! following parts are generic):

implement a parser that can read the host definition and create a data structure representing it.
implement an interpreter that executes over that data structure and invokes the appropriate sub parser's rules in the correct order.

This (limited) scenario does not require any special parsing capabilities such as dynamic tokens, dynamic parser creation or extension. The only thing that appears to be required is supporting multiple start rules, so that individual rules from the sub grammars could be invoked directly instead of being limited to a single top rule.

in that case PEG.js would not be suitable as that feature request has been open for nearly three years now 😢. But any parsing library which supports multiple start rules may be used. The more dynamic capabilities would mostly be relevant during authoring of those sub grammars.

As before, the main worry is that these custom defined host grammars could contain ambiguities. mostly around common prefixes. This can be resolved by host grammar authors by adding delimiters, but that will not resolve the problem of detecting these ambiguities in the first place.

A possible approach would be to build a structure defining the host's grammar also including the sub grammar parts. This is possible with chevrotain as the grammar structure is represented with its own classes And the grammar validations are independent of the parsing flow and can be executed directly.

bd82 commented 8 years ago

closing this as it seems the discussion has finished. 😄