getify / concrete-syntax-tree

Defining a standard JavaScript CST (concrete syntax tree) to complement ASTs
106 stars 2 forks source link

Constituency tree structure #2

Closed jeffmo closed 9 years ago

jeffmo commented 10 years ago

So I've been trying to think about what kind of structure might make sense if we were to just start from scratch (not that we necessarily have to, I just wanted to explore the solution space a bit and have a target to shoot for). So I threw together a rough and dirty structure proposal, and wanted to get some people looking at it and discussing.

I'll back up and mention that my focus here is on use-cases for code analysis and source compilation (not so much code execution). There may be some trade-offs to be made in the design of the structure depending which of those two routes are most desired -- but it seems to me that the vast majority of the uses of JS syntax trees in the wild these days are for source code analysis/transformation.

Here's roughly what I've come up with, and it's mostly inspired by a few of the other syntax tree forms we use in other languages. I took the given example code in the readme and have been targeting that:

/*1*/ function /*2*/ foo /*3*/ ( /*4*/ ) /*5*/ {
  // ..
}

Proposed CST form: https://gist.github.com/jeffmo/9690116 (This is the output of a tiny, super incomplete parser I hacked together. I'm happy to release the code if we decide we want to pursue this style of structure further...but I figured it may be a bit early to start anchoring or discussing detailed parser implementations too much here)

The high level tenet of the structure is that tree nodes are categorized into two conceptual types: branch nodes and leaf nodes (leaf nodes are just token nodes). The tree does not include whitespace tokens as I felt they would be quite noisy and redundant with the location information (which should make it possible to re-construct the whitespace). EDIT: As pointed out below in the comments section, this is an incorrect assessment about whitespace reconstructability. I need to find some time to come back and address this (likely by adding support for whitespace tokens)

Branch nodes stick to the form {type: "string", children: [array], loc: {location obj}}. Leaf/token nodes stick to the form {type: "string", value: <token-specific value type>, loc: {location obj}}

To further simplify parsing a bit, I've also squashed some token types together -- such as keywords (i.e. {"type": "KeywordToken", "value": "function", "loc": {...}}) and punctuators (i.e. {"type": "PunctuatorToken", "value": "(", "loc": {...}}) similar to esprima's node structure for tokens.

Pros:

Cons:

I have far from explored all the edges of the language with this design, so I'm interested in hearing peoples thoughts on this. I'm also happy to edit in pros/cons above as they come up in any comments people have.

getify commented 10 years ago

One quick question at first glance...

The tree does not include whitespace tokens as I felt they would be quite noisy and redundant with the location information (which should make it possible to re-construct the whitespace).

I've heard this suggested several times before. I am quite skeptical. How can "location information" for other concrete nodes tell me what whitespace was in between it and the previous concrete node? For example, it could have been a series of tabs and spaces intermixed, there could have been other non-visible whitespace. There could have been new-lines (though I bet location info that tracks line numbers could help), but there also could have been feed-return characters which wouldn't necessarily mean whitespace.

I'm hoping I'm missing something, but I suspect this isn't quite as possible in practice as it may sound in theory. Can you correct some misunderstanding of mine?

millermedeiros commented 10 years ago

A few comments about it (did not spend too much time analyzing it yet)...

White Spaces can NOT be reconstructed based on loc/range info. "\n\r\n\r \t" is very different from " \n \n ". That is the main reason why on rocambole I store all the WhiteSpace tokens.

I also find it unnecessary to split the MultiLineComment into start/end/text tokens (multiline comments always start/end with same tokens). Same goes for SingleLineComment. - also think MultiLine and SingleLine are the wrong names, since you can have a /* comment */ starting and ending at same line... on rocambole I used the names BlockComment and LineComment.

Also think it should be Punctuator instead of PunctuatorToken, since it's redundant.

millermedeiros commented 10 years ago

I also agree that the format should probably follow a structure similar to this proposal.. that is exactly what I meant on my comment when I said:

I think this problem requires a specific tree structure, probably Left-to-Right and using more Arrays than a regular AST. (structure sequence can be a mix of nodes and tokens..)

But to be honest I think we need to define the use cases and which problems we are trying to solve (scope) to be able to think about pros/cons.

jeffmo commented 10 years ago

Ah yes, you're both right. I was (very wrongly) taking it for granted that whitespace reconstruction could be done simply with spaces -- but that is especially untrue with \r\n.

Ok, we may have to include whitespace tokens after all if our goal is to be truly lossless.

jeffmo commented 10 years ago

On multi-line start/end tokens: I don't feel super strongly about this, but my thought process was that having access to the comment text is a pretty common need, so having the parser do this for those scenarios seemed convenient. shrug

millermedeiros commented 10 years ago

@jeffmo what I did in these cases was to add 2 properties value and raw; where value returns just the comment text and raw returns the full comment (including start/end tokens). During my toString process I use raw if existing or fallback to value. - it's indeed useful to have access just to the comment text!

jeffmo commented 10 years ago

Yea that makes sense. I do think it might make sense to choose only one means of representing the value or the other for the standard though. Having redundant nodes like raw and value could lead to inconsistencies across tools as to which one is the 'source of truth', for example.

Additionally, since this structure is just a constituency tree of token nodes, having redundant nodes in the structure seems a bit awkward (for example: You couldn't have a dumb tree renderer that might work without any node-specific info, like "comments have both a raw and a value child/token, so don't spit out both)

getify commented 10 years ago

what I did in these cases was to add 2 properties value and raw;

In general, I'm not a big fan of this sort of approach. I know it's very common, but I think we should be mindful of extra memory usage, especially on huge trees, or even small trees with huge elements (like a giant 50-line JSDoc comment).

Duplicating the string contents with and without the delimiters can waste lots of extra memory (the contents duplicated, that is). Also, think about these trees serialized into string form (JSON, etc), and passed around between tools (pipes, etc). We would want a tree that is sufficient, but not unnecessarily bloated.

ljharb commented 10 years ago

We also need support for \t - specifically to avoid the flame war, regardless of your feelings about it.

Also, there's many other whitespace characters that we should retain. I should be able to generate a CST for literally any syntactically valid JS file, and reconstruct it verbatim.

getify commented 10 years ago

@ljharb of course, agreed on all counts.

jeffmo commented 10 years ago

@ljharb: Yes I agree now. I plan to update my proposal to re-phrase and include whitespace tokens -- just need a little time to get to doing it

getify commented 10 years ago

Another thing I believe I mentioned somewhere/elsewhere once before: CSTs have to be able to represent things which are concrete syntactic elements (not just whitespace/comments) but which ASTs would normally discard as "unnecessary". For instance:

var a = (((1)));

Parses to:

{
    "type": "Program",
    "body": [
        {
            "type": "VariableDeclaration",
            "declarations": [
                {
                    "type": "VariableDeclarator",
                    "id": {
                        "type": "Identifier",
                        "name": "a"
                    },
                    "init": {
                        "type": "Literal",
                        "value": 1,
                        "raw": "1"
                    }
                }
            ],
            "kind": "var"
        }
    ]
}

As you can see, the unnecessary ( ) are "lost" in an AST, but they most definitely need to be retained in a CST.

jeffmo commented 10 years ago

Right, those kinds of tokens would be covered accurately by this kind of structure since it's, in some senses, just a tree representation of the token stream.

Maybe I should just clean up the hacky little excuse for a parser that I made and put it out with some basic tests so others can play around as well. I was hesitant to do this because I wasn't sure if this was the same kind of direction everyone had in mind -- but maybe it's worth it just for discussion's sake.

I most likely won't be able to get to this until Monday, though

getify commented 10 years ago

FWIW, I'm not sold on forking to a very different structure than what we have for AST. There's a number of flows that likely assume the ability to easily "convert" from CST to AST, and I don't want that to be a complex transformation.

If CST ends up just a super-set of AST, as I hope it will be, then CST to AST is as simple as an O(n) tree walk to remove CST "extras".

millermedeiros commented 10 years ago

@jeffmo you don't need a parser for now.. just plain JSON (like your gist) is enough for us to discuss the structure and think about different problems/solutions.

@getify I think current tools are not a valid excuse to limit how the tree should be structured, we should focus on the problems we are trying to solve and aim for the best viable solution. Adoption by the tools, and implementation of a parser, will be a side-effect of a good format (easy to understand, traverse and manipulate). I probably won't migrate esformatter to use a CST because it would take too long, but I probably wouldn't code rocambole if something like this existed before. - juggling AST nodes is a huge PITA.. It's way simpler when you have a sequential structure (Array or Linked List) where you can simply splice/push items wherever you want.

millermedeiros commented 10 years ago

for me the main requirements would be:

  1. need to contain all tokens of the program (white spaces, line breaks, punctuators, etc.. no data loss)
  2. need to contain basic info about the program structure (if it's an expression, function, variable declaration, etc..)
  3. easy to traverse.
  4. easy to manipulate.
  5. easy to rebuild program from structure.
  6. should be able to serialize into JSON.

LOC and range info is not a big priority since you can easily rebuild this information if you have access to all the tokens and not all programs need that info. (I would make it optional like esprima does).

millermedeiros commented 10 years ago

in my opinion the CST is an enhanced token list, and not an enhanced AST.

getify commented 10 years ago

There are a couple of use-cases that I have in mind which would strictly fail if the CST was nothing more than a token list.

The main one is partial code transpilation, where you need to parse (not just tokenize) the program (preserving the concrete syntax elements), transform only part of the tree (like for instance, one particular series of statements), then regenerate the entire code based on the transformed tree, preserving all the original data from everywhere else except for newly created or deleted nodes that were part of the transformation.

To do such analysis and changes, I must have a hierarchical tree of parsed nodes, to understand the program and find what I need to find to change.

I think current tools are not a valid excuse to limit how the tree should be structured

While that sounds nice and idealistic, I am quite certain that we've got a BIG task ahead of us to change tree structure and get enough of the tools on board with this change that CST becomes a meaningful and productive IR that lets both existing and future tools do what they need to do.

AST is already that de facto exchange medium among these constituent tools, and so what we create has to be a suitable increment/evolution of that, not a completely new invention from scratch, as that would present and even HIGHER barrier to getting buy-in from the existing tools ecosystem.

It is strictly a failure of this project if many of the popular tools in the space think that what we come up with is "not worth it" or "not appropriate". A different tree for some and not for others is going to make life harder on developers who need to get tools to interact, not easier.

I am firmly committed to the goal that what we propose be the smallest and easiest and least painful and most likely to succeed change, not a completely different thing which alienates existing tools and ends up making the ecosystem harder to navigate than it currently is (with all the various ad hoc ways that tools fake whitespace/comment tracking).

It saddens me that you don't think a CST is something that would benefit your tool from using. Our goal is that we not only find a good workable CST that solves use cases, but that we evangelize and inspire tools authors to adopt it. But that having been said, the bigger goal remains, not driven by any one tool or any one use-case, but the best option for the widest interop and adoption.

We don't know exactly what that is yet. But it's not hard for me to pattern match on the things that are quite obviously going to be harder sells and to shy away from them in favor of something that's incremental/generative.

getify commented 10 years ago

Everyone, please go read my proposed Charter/Mission in #4

edwardgerhold commented 10 years ago

There are a few more places, where information gets lost. Like in the StatementLists or FormalParameters. They are just Arrays behind a .body or .formals property. If you pass just the array, you can no longer see, what kind of array it is, whether it´s a StatementList or a list of FormalParameters. I tried adding .type properties to the arrays, too.

Another point of lost information for me is the relative definition to the real ECMA-262 grammar. The AST and the grammar differ quite much in their used words. In the AST you have BinaryExpression and the .operator Property. In the grammar you have an own production for each rank of operators AdditiveExpression, MulitiplicativeExpression. For Generators, all these informations get lost, too, and they have to be restored by using some tables like { "+": "AdditiveExpression", "*": MulitiplicativeExpression } where usually "+" and "+" of the UnaryExpression appear twice. When writing the parser and collapsing the arithmetic Expressions into the Binary Expression i was quite disappointed about the lost information for printing it out to the user. I simply added a second property to save the information.

Then there are the changed static semantics for ES6. That means, there are some new properties to add like .super. For a concrete syntax tree saving the whole information like "isValidAssignmentTarget" or "isAnonymousFunctionDefinition" or "ReferencesSuper" should be available, too, or? Anyways, "isValidAssignmentTarget" is checked by asking node.type for being "ArrayPattern" or "ObjectPattern" but each program will have to implement a test for the node.type. "isAnonymousFunctionDefinition" has a special case in the grammar (maybe a bug), all the arithmetic expression return false (not only a IsCallable(Evaluate(expr)) is asked for)

Some more, also the AST is missing a few declarations like the BindingElements. Or the ModuleDeclarations and the Export and ImportStatements. I don´t know if there is an ongoing discussion for esprima or the other parsers, that´s my problem coming up with that topic. I don´t know what you know about the missing nodes. But when putting it altogether, they should be in the definition, too.

And i have some other things from working with the mozilla AST. When evaluating the code, the VariableDeclarations appearing in the FunctionBody have to be joined into a TopLevelVarScopedDeclarations List and a VarNames List. In the same order each. In the ModuleDeclarations the Imports and Exports have to be joined to some lists, too, which will be processed by the Loader Algorithms. All this should be already done at parse time for efficient parsing and compiling.

Rewriters could take advantage of these lists, too. And static analyzers of course. Or the runtime. They are defined in the actual specification as part of the static semantic requirements.

When checking the LexicalDeclarations for LetOrConst i noticed, that a .kind Property on the VariableDeclarator would be better, if i just put all let and const in one lexical declarations list.

Sorry, I have had little problems formulating it. Found not the right words to write it down more clearly and more compact. The first paragraphs are about the AST itself, the next ones are about the static semantics and the additional/missing properties ES6 requires as well as the top level lists, which the static analysis of es6 scripts require. And the last was, that i found .kind missing on variabledeclarator when writing InstantiateXxxDeclaration() from the spec.

getify commented 10 years ago

@edwardgerhold I appreciate your comments. Many of us have experienced frustrations with the AST semantics and structure. If you're interested, @ariya has a talk on JS tools that points out some big frustrations with the format we settled on: https://speakerdeck.com/ariya/the-future-of-javascript-language-tooling [ Edit: I messed up the reference. It was indeed @michaelficarra's talk SpiderMonkey Parser API: A Standard For Structured JS Representations ]

Unfortunately, I'm not sure that "fixing problems with AST structure" is in scope for our current efforts (though I agree it's a worthy effort). To the extent that there's missing information about what was in the original source code (like skipped whitespace, comments, etc), the CST needs to fix that. But some of the things you bring up sound, on the surface, like additional meta data about the intent of the program, and I'm not really sure if the AST or CST should include them.

But you do correctly point out that ES6 is going to change the AST (add to it) and we need to be sensitive to that. Thanks for the reminder of that reality. :)

FWIW, IMO that's yet another argument in support of defining a CST that is just a layer of properties on existing (and future) AST, rather than a whole separate structure from AST which could then cause our CST to possibly diverge as the main AST format evolves.

michaelficarra commented 10 years ago

@edwardgerhold: I rail on the format's shortcomings for the entire first half of my talk, SpiderMonkey Parser API: A Standard For Structured JS Representations. You might be interested in it.

getify commented 10 years ago

Oops, I messed up and referenced the wrong person/talk in my previous comment. Corrected. Thanks @michaelficarra! :)

edwardgerhold commented 10 years ago

@michaelficarra Thanks, i am. And i will spend my time today with studying the projects pages in depth and take notes of the details.

-I have not mentioned, that i am fine with .expression and .generator on FunctionDeclarations, but sometimes i find it more convenient, to have them separated by their real names, speaking GeneratorDeclaration, -Expression, FunctionExpression. When evaluating it, i have to route them all through the evaluation["FunctionDeclaration"] function, which has to decide then, what to do. Actually i think i have a mixture of both being very inconsistent with what the result shall be.

Before going finally in depth, i can leave already an opinion concerning the .raw and .value properties on the nodes. I think .raw and .value should become a standard part of the AST. You have Escape Sequences and Unicode Escape Sequences. If you just translate the strings, they are gone. But we need the untranslated, too. And ES6 Template Strings do the same for example. They save the .raw and the cooked .value.

Concerning whitespaces, for example, i don´t have an opinion now. I collect spaces to one string and them and skip them later. I think my code is still old and just skipping them when hitting them in the tokens array or just filtering them with a function when displaying the array. LineTerminators i collect as a separate token, as the spec requires them for the ASI and statement termination. I haven´t distinguished beetween \n and \r\n that i treat the latter as one token.

jeffmo commented 10 years ago

I somewhat empathize with @millermedeiros 's sentiment here (though I'm not quite yet really to give up on collab just yet).

It's hard to imagine an augmented form of the existing spidermonkey API that will not result in a a lot of node-specific analysis requirements in order to maintain some kind of source-order traversal (for example, how would I know how to access the comment before the function keyword vs after it -- but also a comment before a ternary condition, after the condition but before the question mark, after the question mark, etc...). The only means of doing this that I can imagine are to come up with node-specific key names to attach comments to...and this means analysis tools either have to be aware of all of these node-specific names, or must be aware of none of them (and must reconstruct source order of the properties on the node themselves -- making the key names useless in the first place, really).

Source-order traversal is a pretty annoying pain point that I've had in many of the analysis tools I've built. That coupled with the fact that CSTs deal with far more flexibly placed bits (whitespace, comments, semicolons, even parens w/ arrow funcs, etc) suggests to me that the currently well-established AST structure just isn't best suited for this kind of data representation.

It would be interesting to see an example CST for a couple of these more complex scenarios so we can discuss the pros/cons.

jeffmo commented 10 years ago

I'd also like to point out that it should still be very possible to convert any lossless form of CST into an AST (whether the structures appear similar or not). Perf would almost certainly be much less of a bottleneck for this operation that parsing source text even (since all of the lexing work has been done already)

getify commented 10 years ago

@jeffmo

how would I know how to access the comment before the function keyword vs after it

It's true that this idea of a CST means that more than one tree could produce the same source code. That's a kind of ambiguity, but not the bad kind. The bad kind would be if an extra in some location could mean two totally different things. That would be intolerable. But that's not what we have here.

What we have here is a " " in source code could be encoded into the tree in different places. A space could be the "after" of one node, or the "before" of the one before it. But this is a solvable kind of ambiguity, fairly trivially. Let me quote myself from #3:

We state a heuristic rule, like "always put the extra on the highest level, and earliest, CST node that the extra being prepended/appended will result in proper output."

That resolves all ambiguity as to where an extra should be annotated. There will always be one "right" answer to that question. All the other places it could be placed are not "wrong", and moreover, tools that use extras would probably work just fine with such "non-standard placement", but we can AND SHOULD make it clear and explicit in the standard where they SHOULD be put. And that rule does it.

As far as using them, code should always look for the extras on every node, not caring about any of those semantics, and handle the extras inline with outputting code for that node. For example, if you have an Identifier node, and it suggests both a before and after extra of " " (single space), then you just prepend and append the " ". It doesn't take any more complex logic than that in the code generation phase, because you have to just assume that whoever created the CST did so in a sane way (according to standard and the above algorithm/heuristic), and intended the extras where they put them.

So these complexities are only for the producers of extras, not the consumers. And the standard will be clear what they should do.

The only means of doing this that I can imagine are to come up with node-specific key names to attach comments to...and this means analysis tools either have to be aware of all of these node-specific names

In my proposal, there's only 3 possible extras positional names to have to know about on any given node: before, after, and inside. before and after are pretty obvious and static in meaning. inside sorta flexes its meaning depending on the node. For an Array literal, it means between [ ], etc. We have to go through and specify what all those cases are, but I've already been doing this implicitly while writing code, and inside has been sufficient for almost all cases. There's only the few exceptions where I needed a virtual node to disambiguate (as explained in my main proposal).

I don't think that should be that onerous to tools makers to deal with those 3 semantics at each node. I have written quite a bit of experimental code in my fork of escodegen using that concept, and in most places it's pretty clean. There's a few places where it's slightly trickier, but not that much so.

The other main proposal calls for several more specific positional names, like insideParamList for instance, and I personally agree that sort of thing does make both the code to generate/track extras and the code to use them more difficult. Not unbearably so, but worse, IMO.

getify commented 10 years ago

@jeffmo

It would be interesting to see an example CST for a couple of these more complex scenarios so we can discuss the pros/cons.

If you give me a suitably "complex" code example (which I can pump through esprima and get an AST from), or just give me an AST, I will be happy to show exactly how my idea of the CST would be. Depending on what it is, I may already have code on my side which produces it, or I may just manually make the annotations to illustrate how the code would do it.

In either case, we should be able to discuss its sufficiency (want to avoid bikeshedding on personal taste, as I mentioned in the charter/mission).

bterlson commented 10 years ago

@getify Think it would be good to start collecting PRs for concrete proposals in this repo? It would be good to see what the competing proposals are, including example code and associated CST, and perhaps source code for a few transformations... Maybe just create a proposals folder that contains a .md file for each proposal?

getify commented 10 years ago

I don't mind if people want to create issue threads (or PRs with a proposals folder) if they'd like to document their ideas.

My goal is to first validate or invalidate the two main proposals before we explicitly turn our attention to other proposals, but I don't think that means we can't catalog them in the repo. :)

edwardgerhold commented 10 years ago

edited - i made a big mistake in interpreting IsAnonymousFunctionDefinition. It´s fine as it is and bubbles up the whole expression. Calling IsCallable would probably introduce a performance bug.

Meanwhile i crunched my vision of what you mean with the CST a little bit down and see the processors with their new additional standard features of non lossy input capturing. It fits perfectly then into the Flow Diagram of Page #3.

About the static semantic properties in EcmaScript 6:

I wrote a little textfile on my harddisk listing up the properties again, which appear in the EcmaScript 6 draft. The belong to the abstract syntax representation of ES6 like Parser_API is for ES5. My favorite would be your CST definitions together with a publication of a complete say esprima-harmony AST for ES6 to update and replace the SpiderMonkey Reflection. Then you have the concrete CST plus the official update of the nodes in one paper. I´ve cut down the text file to the bullet points but have not corrected the sentences no more. I just list the PropertyName from the Table of Contents and say, what they were for.

That´s almost everything necessary from the ES6 draft to augment the Parser_API. It´s less than expected, when listed in order, or? Without the Nodes like ClassDeclaration, MethodDefinition, ModuleDeclaration, ImportStatement, ExportStatement, RestParameter, SpreadExpression, BindingElement, Elision which are just standard containers for the fields their productions have, quite self explanatory. The most difficult is to find a good property names consistent with the existing AST. The other the write up. If you don´t do them, you will probably spend more time in analyzing these constitutions, when analyzing the next versions code. I find it awful to reanalyze the tree for the runtime and ruined my head spoken making out how to solve this problem. For my thing i would go this way, as well as replacing the with bytecode of course (in the tool. A bytecode definition would be cool, but that is really a different topic.) Compiling the AST code is easier, too, if all flags are already known to become compiled.

@getify i have read your goal now again and think about how to do this. when translating your words, i didn´t catch it really.

@getify i have overread your welcome above and that we share the opinion and while being offline my mind crunched the vision down a bit. Sorry for having you a little confused. I think these properties could also make out the concreteness of the tree a bit, because they try to meet the ecma-262 requirements.

edwardgerhold commented 10 years ago

@jeffmo one week and i understand what you´ve posted first. I´ve just checked the gist example. Today i understand. I had pieces of in memory, while trying to post the days ago, some pieces from reading, but no picture. Now it´s a bit/much clearer. I´ll come back leaving a message, when i´ve got something more interesting for you. The last two days i just improved my own code to make room.

And @all meanwhile i´ve added something to the builder interface. If the CST will keep it with the AST as much as possibly, i would suggest adding the formal parameter "extras" behind the also optional "loc" argument, for every builder interface function. Like builder.newExpression(callee, args, loc, extras); or like builder.callExpression(callee, args, loc, extras);. I added them behind the codegen functions to make sure i pass all extra objects in. The reason i add it behind the loc is just not to break existing code using these interfaces and maybe your future CST tools from using their old modules.

I told myself reading your definition again would help me in finding the rest out and i also got a link for more information, but as @jeffmo makes real sense for me now, the others will do, as promised, too. I will call back later when i´ve figured some more practical implementation out, i´ve prepared the codegenerator module for. Now i have to add the extras to the nodes and to write the additional src += extras.property.before; lines of code between the other src += callBuilder(formals) pieces. Then a (function /* 1 */ name () { /* 2 */ return 10; }).toString() should print function /* 1 */ name () { /* 2 */ return 10; }.

I´ve found the golden keywords for extra property names. If you wish to put them next to a certain property, say left or right of .id of the function, just take funcDecl.extras.id.before/after. Or in abstracted and general form node.extras.node_property_name.before/after/inner. That naming convention seems to be the absolute best for any property on a node. because the same name is repeated in extras.