Chevrotain / chevrotain

Parser Building Toolkit for JavaScript
https://chevrotain.io
Apache License 2.0
2.51k stars 206 forks source link

CST Creation customizations #603

Closed CMCDragonkai closed 6 years ago

CMCDragonkai commented 7 years ago

I noticed something strange about this:

  body = this.RULE('body', () => {
    this.MANY(() => {
      this.SUBRULE(this.key);
      this.SUBRULE(this.value);
    });
  });

vs

  body = this.RULE('body', () => {
    this.MANY(() => {
      this.SUBRULE(this.keyValue);
    });
  });

  keyValue = this.RULE('keyValue', () => {
    this.SUBRULE(this.key);
    this.SUBRULE(this.value);
  });

In the first case, the output CST results in a case where the keys are grouped into their own array, and the values are then grouped in their own array. While in the second case the key and value are kept within a single key value combination, and then grouped into an array.

I had it the first way first, and tried to create an interpreter for the resulting output cst, but realised that it wasn't going to work because of how the key and value was separated. So I tried to do it the second way instead. But I think it's somewhat non-intuitive that:

this.MANY(() => {
  this.SUBRULE(this.k);
  this.SUBRULE(this.v);
});

Doesn't mean how (kv)* would be interpreted to mean a sequence of kvs, and should give [kv, kv, kv]. But instead right now it's [k,k,k,k], [v,v,v,v].

Another example is:

  value = this.RULE('value', () => {
    this.MANY(() => {
      this.OR([
        {ALT: () => { this.CONSUME(ValueSpaceT); }},
        {ALT: () => { this.CONSUME(ValueStringT); }},
        {ALT: () => { this.CONSUME(ValueQuotedStringT); }}
      ]);
    });
  });

In the output CST I get ValueSpaceT: [...], ValueStringT: [...], ValueQuotedStringT: [...]. Now I don't know what the relative order between each token is, so I cannot concatenate the value of all these tokens.

Compare to the usage of embedded actions: https://github.com/sgarciac/bombadil/blob/master/src/parser.ts#L8-L21 which shows the ability to rely on the exact ordering to perform side effects during the parsing.

bd82 commented 7 years ago

Both approaches are possible either by manually extracting to additional rules or using inlined rules. which allow limited "tree rewriting".

The CST building is very low level and "stupid" so it cannot really guess if the user wants another level or nesting or not. Initially I tried to add more logic to the CST creation, but it became too complicated, so I just went with as simple set of rules as possible.

had it the first way first, and tried to create an interpreter for the resulting output cst, but realised that it wasn't going to work because of how the key and value was separated.

Why won't it work?

function visitbody(ctx) {
// https://lodash.com/docs/4.17.4#zip
const keyValuePairs = _.zip(ctx.keys, ctx.values)
// work on joined pairs now.
}

Now I don't know what the relative order between each token is, so I cannot concatenate the value of all these tokens.

It is a little ugly, but you could re-compute the order based on the location information. However a better solution would probably be to used a in-lined rule for the alternation.

this.RULE('values', () => {
    this.MANY(() => {
      this.OR({
       NAME: "singleValue",
       DEF: [
          {ALT: () => { this.CONSUME(ValueSpaceT); }},
          {ALT: () => { this.CONSUME(ValueStringT); }},
          {ALT: () => { this.CONSUME(ValueQuotedStringT); }}
      ]});
    });
CMCDragonkai commented 7 years ago

One of the problems is that some keys may not have values (although in this case it's not a problem because no value means empty string). Zipping also doesn't allow future refactorings to be tail-recursive (I think).

I moved to using embedded actions instead simply because the order of evaluation is easier to understand.

bd82 commented 7 years ago

I moved to using embedded actions instead simply because the order of evaluation is easier to understand.

Embedded actions are indeed the most powerful low level approach. At the cost of more verbose grammars.

A possible hybrid solution would be to use embedded actions to build your own low level data structure and have the more complex processing on that data structure post parsing.

But I am not sure your grammar is complex enough to warrant that.

CMCDragonkai commented 7 years ago

So I came up with a different CST that I think is more flexible (that I'm using atm):

type CST = Leaf | Node;
type Node = {
  name: string,
  childList: Array<CST>,
  childDict: { [name]: Array<CST> }
};
type Leaf = IToken;

This allows ordered access into the relative parts, while also allowing member access that is ordered in their apperance as well. The name of the dictionary should be limited to either the rule name, or the token name. Both childList and childDict ends up with pointers to the underlying CST object, so there's no real increase in memory usage, except for a doubling of pointers.

bd82 commented 7 years ago

Your comment raises the question of supporting custom CST structures.

How are you building the structure you described? using embedded actions or during post processing on the original AST.

You could probably already customize the internals of building the CST to also create The ordered array.

Have a look at: https://github.com/SAP/chevrotain/blob/0b874a5cd5b89cf92ab528766c3fdc518ec779f1/src/parse/parser_public.ts#L3171-L3207

Some refactoring could be done to make this easier:

Do you think that easier to customize CST is a worthwhile feature?

CMCDragonkai commented 7 years ago

I am using embedded actions to do this. Actually really I just make every rule construct and return a CST. So it's not that noisy. As for customising the CST, I think it would be a good idea. I just think that by default, ordered access should be available since it's really useful for traversal. (Serialisation of CST back into the file).

bd82 commented 7 years ago

I am not sure about supporting this by default as not all(most?) use cases do require serialization back into a file and the creation of additional arrays can have a substantial performance cost. Also I'm not sure how to modify the CSTVisitor to expose the ordered list.

However, A little refactoring to support some customization may allow advanced use cases such as yours without extending the scope of the built-in capabilities of Chevrotain.

I will keep this as an open feature request.

CMCDragonkai commented 7 years ago

This is just one more array for each node, is the performance cost that significant? It's not just serialisation back to the file, but during the CST visitor itself may required ordered access, such as the case of my original value example, that's for translating a config file into an hashmap.

bd82 commented 7 years ago

is the performance cost that significant?

It might need to be tested. The creation of the CST will reduce performance by 1/3 for a simple JSON grammar and by potentially more for complex grammars. I believe this is due to the creation of the great many potentially empty arrays. And need to investigate ways to optimize this.

While I understand your use case for an ordered array, I feel this is not a blocker and could be resolved in other ways without resorting to embedded actions, So I am not (yet) convinced yet that the requirement should be a native capability of Chevrotain. Maybe if/when I encounter additional requests for this scenario or implement a parser that requires it myself I will change my mind 😄

I do however see the use case for arbitrary customization of the CST nodes creation. And think that a little refactoring in Chevrotain's internals could enable an end user such as yourself
to very easily implement such a capability using dependency injection during the parser's configuration.

CMCDragonkai commented 7 years ago

If the streaming issue were resolved, it'd be possible to acquire pipelined parallelism (maybe... if the js runtime supported userspace threading).

bd82 commented 6 years ago

I will investigate supporting custom CST creation as part of 3.0 release. I think this is likely already possible by overloading a few parser methods. So it will hopefully be an issues docs and perhaps wrapping related methods in a nice clear interface.

bd82 commented 6 years ago

Looked deeper into this. This could be implemented using the following overrides of the Parser methods:

class MyParserWithCstLists extends Parser {
    cstInvocationStateUpdate(fullRuleName, shortName) {
        // default initialization
        super.cstInvocationStateUpdate(fullRuleName, shortName)

        // Init the child List
        const currentCstNode = this.CST_STACK[this.CST_STACK.length - 1]
        currentCstNode.childList = []
    }

    cstPostNonTerminal(newCstNode, newRuleName) {
        // "standard" addition to the children dictionary
        super.cstPostNonTerminal(newCstNode, newRuleName)

        // Add nonTerminal to childList
        const previousCstNode = this.CST_STACK[this.CST_STACK.length - 1]
        previousCstNode.childList.push(newCstNode)
    }

    cstPostTerminal(key, consumedToken) {
        // "standard" addition to the children dictionary
        super.cstPostTerminal(key, consumedToken)

        // Add Terminal to childList
        const previousCstNode = this.CST_STACK[this.CST_STACK.length - 1]
        previousCstNode.childList.push(consumedToken)
    }

   // ...

Of course the CST visitor would also have to be modified to expose that "childList". Or alternatively the "childList" itself could be placed on the children dictionary to work around that issue.

I think this example should suffice for now, I will reconsider investing more effort into this (proper interfaces / runnable examples) if and when there would be more requests for such capabilities.

twirlse commented 6 years ago

I'm also interested in generating custom CST output, mainly to perform post-parse analysis on CST. I tried to copy paste your code from the last comment, but it seems that those methods are not exposed by parser. Is there anyway to do this ATM?

bd82 commented 6 years ago

These methods still exist on a parser instance. See the tree_builder parser trait (it gets mixed in to the parser prototype).

But those are non-public methods, they are not exposed as part of the public APIs. But this does not mean you cannot override these methods, they are still available at run-time.

Are you using TypeScript to compile your project?

twirlse commented 6 years ago

Yeah I noticed them in source code, but not in public API. We are using TypeScript, I guess I'm not that familiar with the language to figure out how to access them. I guess I could use Parser.prototype and cast it, but it seems ugly. Any help would be appreciated! :)

bd82 commented 6 years ago

I got one of the methods to compile using some casing and "@ts-ignore" comments.

    cstPostTerminal(this:any, key, consumedToken) {
        // "standard" addition to the children dictionary
        // @ts-ignore
        super.cstPostTerminal(key, consumedToken)

        // Add Terminal to childList
        const previousCstNode = this.CST_STACK[this.CST_STACK.length - 1]
        previousCstNode.childList.push(consumedToken)
    }
twirlse commented 6 years ago

Thanks for help, I'll look into it.