CST-AST flow - Githubissues

getify / concrete-syntax-tree

Defining a standard JavaScript CST (concrete syntax tree) to complement ASTs

107 stars 2 forks source link

CST-AST flow #3

Closed getify closed 9 years ago

getify commented 10 years ago

cst-ast flow

This is in response to @Constellation's figures on another escodegen thread, which I hope get moved here.

Things to note different about my diagram:

There is only a "Parser to CST". "Parser to AST" would be a convenience API method that simply called "Parser to CST" -> "CST to AST". It would make no sense to me to have a parser which either was duplicated (one set of logic for AST, one for CST), or just more complex at each step, having to figure out if it should or shouldn't keep the CST info. I think it always should.

If your use-case for output of parsing is that you only want the AST, and don't care about the CST data, you strip it out with "CST to AST". You can do so manually, or you can call the "Parser to AST" helper method. Either way, there's just one parser: "Parser to CST".

I think this greatly simplifies the work needed to modify existing parsers. They just make one change to start (always) producing CSTs, and provide a simple CST-to-AST stripping conversion step, and maybe a convenience method "Parse to AST" which just calls both.

Also, it means that lots of existing tools which expect ASTs only don't have to change at all. If you have a CST and want to use that tool, you use the AST-to-CST converter first. Could those tools add convenience methods? Sure. But they don't have to.

Minimizing the amount of work change to implement CSTs.
My approach assumes that all ASTs are CSTs (strictly, that ASTs are subsets of CSTs), but an AST is a not-terribly-interesting CST in that it's missing all the extras a CST would normally have. IOW, an AST is a minimal CST.

This means any place that expects a CST (like linters, code formatters, etc) can either take an AST or a CST. Any place that requires an AST needs an AST only (not a CST), but that's OK because any place that produces CSTs (like parsers) will also provide the simple CST-to-AST converter, so it's trivial to get an AST from a CST and pass it wherever ASTs are expected.
Escodegen (and any other code generator) changes to only expect a CST. If you pass in an AST, that's OK, because (as I said above), all ASTs are CSTs, so no big deal.

If any tool (like escodegen or any other) wants to add default formatting like whitespace or comments or whatever, they take a CST (or an AST!) and add the "extras" annotations directly to it. A CST then is just an AST + extras.

I think this also minimizes complexity of Escodegen, it doesn't need to be two separate components, and it definitely doesn't need to have forked logic for AST vs. CST. It assumes a tree that will provide it whatever extras it needs, and it may add its own, and it cares not if this tree passed to it was an AST or a CST (AST+extras). Either way, it uses extras if present, and continues normally if not. Simple.

millermedeiros commented 10 years ago

I think that CST being a superset of AST is the wrong approach. They have different purposes and what is important on the AST is not as useful on a CST.

CSTs are usually created during the parsing of the program (lexical analysis), and ASTs are generated afterwards based on the token structure (syntax analysis). So the real flow would be more like:

    source text
         +
         |
         v
 +---------------+
 |               |
 | parser to CST | tokenizer (lexical analysis)
 |               |
 +-------+-------+
         |
         v
  +-------------+
  |             | linters, code formatters, CST to AST, etc
  | CST modules |
  |             +---------------------+
  +-------------+                     |
         |                            |
         v                            v
  +-------------+              +-------------+
  |             | (escodegen)  |             |
  | CST to code |              | AST modules | esmangle, escope, AST to CST,
  |             | <----+       |             | etc
  +------+------+      |       +------+------+
         |             |              |
         v             +--------------+
   generated text

It would be possible to convert between both formats, but not strictly necessary. The amount of work needed to convert AST to CST should be basically the same as converting the AST directly to code (same work as escodegen currently does).

For me the value of CST is that it would simplify linters and code formatters. I don't think that we should limit ourselves because of existing tools that expects AST as an input. - in some cases I can even see needing both structures to be able to reason about the program.

This explanation about CST vs. AST should give a better idea of the differences between the formats: http://eli.thegreenplace.net/2009/02/16/abstract-vs-concrete-syntax-trees/

in my opinion the CST is an enhanced token list, and not an enhanced AST.

getify commented 10 years ago

in my opinion the CST is an enhanced token list, and not an enhanced AST.

I responded in depth here why your conception (while great in spirit) fails an important set of use-cases that I'm driving after:

https://github.com/getify/concrete-syntax-tree/issues/2#issuecomment-38344545

Constellation commented 10 years ago

Hi, all. I've moved from escodegen's thread.

In my opinion, CST doesn't equal to AST. I agree with @millermedeiros from this point of view. CST and AST have different goals. AST is useful because it catches only semantic changes. To analyze semantics, CST is too low level. It isn't suitable for this use.

In the partial rewriting case, if partial rewriter use CST, it's too complicated. If the partial rewriter use CST, partial rewriter should take care of parentheses, whitespace etc. to generate a valid JS code. It is similar to modifying source code text directly. I think the best way to create the partial rewriter is,

generating CST
generating AST from CST
connecting AST and CST
provide AST to users
detecting AST changes
generating the result CST from the original CST and modified part of AST

Here's my proposal. next

In this diagram, AST and CST aren't the same. Explicit AST to CST / CST to AST conversion is necessary. AST / CST modules compose a layered architecture. If you intend to analyse semantics (not textual difference), Only you should take care is AST and you don't need to have knowledge about CST. Modifying CST is unintented work for AST analyzers.

It is strictly a failure of this project if many of the popular tools in the space think that what we come up with is "not worth it" or "not appropriate". A different tree for some and not for others is going to make life harder on developers who need to get tools to interact, not easier.

CST and AST have different goals. For escope, esmangle and such a modules intend to analyze semantics, CST is not suitable. On the other hand, if we create a tool such as JSHint, we need CST and AST. CST is useful to analyze textual information, and it's dropped in AST. And AST is useful to analyze semantics. The areas in which AST and CST are used don't confilict.

And code generation module should accept CST to provide fine grained controls. But if you intend to analyze semantics (not textual meanings), treating CST is overspec and too lower representation. If you don't need to control textual information of the output (If what you only need is generating valid JS), you can use AST to CST and CST Codegen

getify commented 10 years ago

I think that's a crazy more complicated way of doing things. Having to keep an AST and a CST in parallel and using each? That's literally the worst possible outcome of this effort, IMO.

Moreover, I already have code that is doing what I suggested, which is modifying a tree that's an AST but that also has the CST extras in it, and it's absolutely NO different than modifying an AST without the extras. So I don't know what you could possibly mean by that being more tough. What you suggest is WAY more difficult to do, in my experience.

Constellation commented 10 years ago

I think that's a crazy more complicated way of doing things. Having to keep an AST and a CST in parallel and using each? That's literally the worst possible outcome of this effort, IMO.

For example, simply creating Map<ASTNode, CSTNode> works nice. (Or appending ASTNode.prop = CST). And actually, recast module do so.

Moreover, I already have code that is doing what I suggested, which is modifying a tree that's an AST but that also has the CST extras in it, and it's absolutely NO different than modifying an AST without the extras. So I don't know what you could possibly mean by that being more tough. What you suggest is WAY more difficult to do, in my experience.

Hmm, I can't imagine that CST works correcty as AST with extras. For example, if you take a CST/AST from a + b + c, you'll get the following AST.

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Identifier",
            "name": "a"
        },
        "right": {
            "type": "Identifier",
            "name": "b"
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c"
    }
}

And if AST users modify the 3rd line's operator to "*",

{
    "type": "BinaryExpression",
    "operator": "*",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Identifier",
            "name": "a"
        },
        "right": {
            "type": "Identifier",
            "name": "b"
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c"
    }
}

And generating code from this, AST users expect the result is (a + b) * c. But if it is considered as CST, since there's no parentheses information, the result becomes a + b * c and the result becomes completely broken. How to prevent it? Even if a simple rewriter must consider parentheses's priority, correct whitespaces etc, I think it's crazy complicated and not composable.

bterlson commented 10 years ago

Does anyone have an idea of what kind of overhead is implied by always parsing to CST and then down-converting to AST? Seems like it could be significant in large programs which would put pressure on parsers to produce AST directly anyway for semantic analysis purposes (purposes which are often perf sensitive).

jrootham commented 10 years ago

@Constellation Do you really have a real example of a tool that does that? I would have thought the normal flow there is change source -> CST -> AST. Also it seems clear that an AST->CST transformation is fundamentally not possible.

If a tool outputs AST anything downstream needs to assume AST.

jrootham commented 10 years ago

My attack on the performance problem is to assert that CST's are the ground truth and are persistent. Parsing only happens once (on import) editing manipulates the CST. Pretty printing is CST->source transform.

I understand the horrible violence this does to the current assumptions about how things work, and I don't think this project is going there.

Constellation commented 10 years ago

@jrootham

If AST users don't intend to gain fine grained control over output text, AST to CST can provide some good default code generation behavior. By this AST to CST can provide coarse grained control over output text.

For example, current escodegen exactly does this. It takes AST and generates source code. It only guarantees that parseToAST(code) structually equals to parseToAST(escodegen(parseToAST(code))).

getify commented 10 years ago

@Constallation

For example, if you take a CST/AST from a + b + c, you'll get the following AST. [snip] And if AST users modify the 3rd line's operator to "", [snip] And generating code from this, AST users expect the result is (a + b) * c. But if it is considered as CST, since there's no parentheses information, the result becomes a + b \ c and the result becomes completely broken.

No, this is all a fundamental misunderstanding of what I mean by a CST being a superset of an AST.

a + b + c

would, in my proposal, produce a CST like this, or something similar (see how it's a normal AST + extra annotations):

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Identifier",
            "name": "a",
            "extras": {
               "after": " "
            }
        },
        "right": {
            "type": "Identifier",
            "name": "b",
            "extras": {
               "after": " "
            }
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

Then, if you go make some transformation that turns the operator into a *, you get this tree:

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "*",
        "left": {
            "type": "Identifier",
            "name": "a",
            "extras": {
               "after": " "
            }
        },
        "right": {
            "type": "Identifier",
            "name": "b",
            "extras": {
               "after": " "
            }
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

...which still has all the same extras information in it as it did before the transformation. And if the type of transformation you are doing is more complex, where you're dropping nodes, recreating them differently, etc, and that might cause some of those extras annotations to be lost, that's actually not a terrible thing (and is kinda expected) because it's quite possible that the transformation is invalidating the previous whitespace assertions, etc.

That above tree would produce this code when generated back out:

(a + b )* c

Now, notice that in this case, it's slightly less desirable that the space comes between b and ) instead of between ) and *. It's not terrible, and I don't think people would necessarily consider it a deal breaker as the generated output of a transformed program. But, a more intelligent transform would/could notice this nuance and could make the tree be this after transform, instead:

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "*",
        "left": {
            "type": "Identifier",
            "name": "a",
            "extras": {
               "after": " "
            }
        },
        "right": {
            "type": "Identifier",
            "name": "b"
        },
        "extras": {
           "after": " "
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

...which could then produce:

(a + b) * c

...exactly as one might expect. Of course, it doesn't have to do that extra nuanced step to serve the extra intelligence in transformations. But tools would have the option of being more or less intelligent as they saw fit.

The point is that the CST-as-superset-of-AST is totally a workable tree format for simultaneously tracking extras AND allowing meaningful parsing and transformation. These are not mutually exclusive concepts in my proposal.

Constellation commented 10 years ago

@getify: It seems that there's no parentheses information in CST. In this representation, I think if ((i = 20)) cannot be represented. So, could you show me the CST generated from (a + b) + c?

getify commented 10 years ago

@bterlson

Does anyone have an idea of what kind of overhead is implied by always parsing to CST and then down-converting to AST? Seems like it could be significant in large programs which would put pressure on parsers to produce AST directly anyway for semantic analysis purposes (purposes which are often perf sensitive).

I'm quite skeptical that there are very many, if any, tools which are so strictly brittle that they would choke upon seeing nodes like I suggest in the tree above, which just have extra attributes like extras hanging out. So it really isn't all those that would necessarily NEED to be stripped.

The only stuff that would have to be stripped, which I think in general which would be a small subset of all the CST stuff, are any of the virtual nodes (which such AST-only tools wouldn't understand), like for instance the EmptyExpression node that might sit inside an empty params element of a function declaration.

And again, that virtual EmptyExpression placeholder node would only be in the CST if there was actually some extras (whitespace, comments) in that location, which would be even less infrequent than "always".

So, the CST-to-AST down conversion may seem like it would be an expensive tax, but I highly doubt it's going to, in practice, provide any noticeable difference, and certainly not such a pain that it's so much more important than all the benefits CSTs are getting.

Of course, if a parser tool did have that pressure to be as stripped down as possible, they could have an option to "suppress CST tracking info" while parsing, and then what they'd produce is an already-stripped AST. But I doubt there'd really be anything more than theoretical "demand" for such a thing. It's not outlawed by my approach, but I think it's probably in practice unnecessary pre-optimization.

bterlson commented 10 years ago

@getify Just to clarify, I didn't claim any tools that consume ASTs today couldn't consume a CST. I was just wondering what, in practice, the overhead would be to produce a CST and then toss away (or ignore, if you prefer) much of the information vs. producing an AST as today. How much demand there is for straight-to-AST parsers remains depends on how much overhead producing a CST introduces. So I was wondering if anyone had any idea how much overhead we're talking about, since clearly there will at least be some.

Edit: to say why I care, I frequently do static analysis across a massive corpus of JS files crawled from the web and parse performance is linked to my compute costs :)

getify commented 10 years ago

@Constellation

So, could you show me the CST generated from (a + b) + c?

I hinted earlier in this or some other thread, there are definitely cases where more than just whitespace or comments are things the CST parts of the tree need to track that the standard AST parts of the tree do not.

Unnecessary (according to the grammar) ( ) are a perfect example of an "extra" that needs to be tracked. Turns out this is exactly as trivial as tracking whitespace (which the grammar also treats as unnecessary).

Here's my proposed tree for handling that:

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Identifier",
            "name": "a",
            "extras": {
               "after": " "
            }
        },
        "right": {
            "type": "Identifier",
            "name": "b"
        },
        "extras": {
           "before": "(",
           "after": ") "
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

Note: The combined ") " annotation could just as easily be said to be correctly represented as [ ")", " " ], an array of each extra token, if it turns out that some tools would like to process some extras, but not others. We probably don't want to have those tools re-parse the extras string. Keeping those extras strings as separate makes it trivial to look at the array of extras and find what you care about.

The algorithm for producing this kind of tree would be, roughly:

parse the full unfiltered token stream
whenever you find anything that the grammar says is unnecessary and would otherwise be discarded, whether that be whitespace, comments, ( ) pairs, ;s, trailing commas in a list, etc, create an extras annotation for what was found.

It's really just that simple. The "hard" part glossed over, which is really not that hard as it turns out, is knowing at which node of the tree to add the extras. Any given extra can often be validly represented at multiple levels of nodes, depending on circumstances.

In my testing, this is not that hard, but it takes a little care.

For example, in that above tree, I represented the ( ) pair in before/after tokens (respectively) of the BinaryExpression node. I think that makes most sense. But I could have represented a ( as the before of the a node, and a ) as the after of the b node, and ended up with the same code generation.

So yes, there's multiple different CST representations for the same code. That's not a problem though, because the code generator just blindly prepends before and blindly appends after for every single node, whatever node they may be, and expects that the CST it's given is totally valid.

So how do parsers avoid ambiguity or (worse) not agreeing on where to put extras? What would happen if Acorn produced a different CST (same AST) as Esprima?

I don't think this is actually as bad as it may sound.
If it is bad, and we need to avoid it, it's easy to do. We state a heuristic rule, like "always put the extra on the highest level, and earliest, CST node that the extra being prepended/appended will result in proper output." So, bam, no more ambiguity.

In short, I think our specification proposal can say what extras can exist, and where they should end up annotated in the tree. Then parsers just have to adhere to the spec! Problem solved. :)

millermedeiros commented 10 years ago

I doubt that we will be able to represent all edge cases with just "before" and "after", specially when you add comments and/or multiple tokens.

There are also cases that {} is optional and it's way harder to detect it if you don't have access to the tokens. (You need to check if a child/body is a BlockStatement) and each node have an unique structure, which makes it really hard to create generic methods to handle all cases. (See if/else/else if/while)

Constellation commented 10 years ago

For example, in that above tree, I represented the ( ) pair in before/after tokens (respectively) of the BinaryExpression node. I think that makes most sense. But I could have represented a ( as the before of the a node, and a ) as the after of the b node, and ended up with the same code generation.

Hmm. It seems that it leads the system broken. For example,

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Identifier",
            "name": "a",
            "extras": {
               "before":"(",
               "after": " "
            }
        },
        "right": {
            "type": "Identifier",
            "name": "b"
            "extras": {
               "after": ") "
            }
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

And when a rewriter replace a with newly created node,

{
    "type": "BinaryExpression",
    "operator": "+",
    "left": {
        "type": "BinaryExpression",
        "operator": "+",
        "left": {
            "type": "Literal",
            "value": 42
        },
        "right": {
            "type": "Identifier",
            "name": "b"
            "extras": {
               "after": ") "
            }
        }
    },
    "right": {
        "type": "Identifier",
        "name": "c",
        "extras": {
           "before": " "
        }
    }
}

The codegen will generate 42 + b) + c.

I think this is due to that CST doesn't have the structural information about "parentheses should be paired". I'm skeptical extra can catch all CST informations. And this information is not necessary for AST users. That's why I think AST should be distinguished from CST.

getify commented 10 years ago

I doubt that we will be able to...

Have you read my proposal and the background on how I got to it? Doubting in nothing but a general sense is not particularly useful. All these things you keep bringing up were things others brought up, and there all things I've thought of and explored. Rather than trying to shoot down my proposal with generalities and doubt, I'd appreciate if we could explore it to its fullest extent. I'm not a 100% grammar expert, but I've spent quite a bit of time studying this topic now, and am not aware of any cases that I've missed in those previous explorations.

And even if we do find a single specific example that needs to be worked around or improved, that doesn't mean the entire proposal is bust. I think I've demonstrated that it substantially (or completely) covers the vast majority of cases, so that would have to provide a reasonable place to consider working from.

I understand (and appreciate) your reluctance based on personal taste, but I don't think you're providing any "concrete" substantial argument why the proposal is outright insufficient.

To be specific:

My proposal calls for before, after, and inside, which covers the vast majority of cases I studied in the grammar. There were only a few places where the "virtual node" concept was needed to provide a node where we could hang extras off of. The clearest example was the empty params list on a function declaration, which might well need a virtual placeholder CST node inside it. But these were pretty few in my grammar explorations. before, after, and inside covered better than 95% of the cases.

FYI: the other major proposal from our previous explorations called for a similar approach (annotating extras in the AST tree), but not quite in the same way I am suggesting. It suggests inventing additional grammar terms besides before, after, and inside, like for instance inEmptyParamList. The main benefit is that this proposal avoids the need for any virtual nodes, but the downside was that we found there might need to be as many as 10 different possible positional names across the various differences in the grammar, which (I think) needlessly complicates producing and consuming extras annotations.

But, regardless, BOTH proposals considered and covered the extent of the grammar, and BOTH are demonstrably capable of representing extras in all the places where it's needed.

If we do find some small niche case(s) where it's not currently capable, I'm pretty sure EITHER proposal is easily extended/modified in a small trivial way to address that.

it's way harder to detect it if you don't have access to the tokens

Who said anything about not having access to the tokens? You have the tokens when you create the tree. It's perfectly unambiguous at tree construction time if you need a { } pair or not. So you either add the BlockStatement node or you don't.

If the source code has a { } in one of those (rare) positions, but you don't need the BlockStatement node, because the grammar says it's unnecessary, then my suggested algorithm calls for it to instead be represented as an "extra", like "{}" for instance, so as to prevent information loss (by the very definition of a CST, if the source code has it, we don't want to drop it).

If there was even a rare case in a transformation where a BlockStatement became necessary that wasn't before, we still don't have information loss, because a transformation could elevate that "{}" extra to a real BlockStatement node. Vice versa, if a required BlockStatement node became optional, the transformation could easily remove the proper node and put in a "{}" extra annotation instead.

Moreover, since all current tools are only using ASTs (none of them are passing around both ASTs and token lists), I don't think there's any evidence that we have to start preserving token lists along side the trees.

getify commented 10 years ago

And when a rewriter replace a with newly created node,

You're assuming a rewriter would be "dumb" enough to just discard all extras. I think the safer thing, when replacing a node, would be to preserve the extras from the node you replaced. That's what I'd call an "intelligent" transform process.

I have already written code which does these sorts of things (preserving extras on transformations) and it's producing pretty reasonable (strictly grammar correct) code. There've only been a few places in my transformations where the safe path of preserving extras across transformation produced slightly uglier code, like leaving in an extra space that's now unnecessary, but even those are things that "intelligent" transforms CAN detect and fix.

I'm skeptical...

Again, that sort of broad skepticism without actual demonstration is not particularly helpful. I'd say the same thing here as I did in my previous reply: I understand distaste for the approach, that's your own personal opinion. But let's just stick to the facts. And moreover, let's quit re-hashing things which have already been explored and are demonstrably not an issue.

If you can prove via code that the propsoal is strictly insufficient, let's see that. But if you just "doubt" that it can be done, I'm pretty sure I can show you code that proves it can be done.

I have a number of tree-transformation tools which I've been working on over these last 6+ months since the discussion first started, and I decided to implement my proposal to see if it really would work, and I haven't run across any cases where it was flat out insufficient.

I'd also like to point out that all the current objections being thrown out (in addition to having already been discussed) are weaknesses in the alternate proposals being advocated for, in various different ways.

It is a tough problem to preserve concrete syntax across transforms, and that's true no matter what tree or token-list (or both) you're dealing with. The reason this whole project exists is because it's been "tough enough" that no one has pushed through and solidified the solution yet. That's what we're trying to do. And we're all smart people who can figure out how to write smart code to solve tough problems.

I'd appreciate a tone shift in the discussion away from skepticism and doubt to optimism and determination to solve.

getify commented 10 years ago

And this information is not necessary for AST users.

That's not a relevant metric (nor is it even objective) for our discussions. Of course it's not, or no parsers would ever have been able to work before. CSTs are adding extra information that was previously discarded, because a new set of tools needs new information.

It's a failure of this process if people just say "well, if those new tools need more info, they can go get their own data". The spirit of what we're trying to create is a unified IR that serves both use-cases in a reasonable and adoptable manner.

If we don't get that, we fragment the JS tools community more, and make it harder for complex use-cases to weave together various tools. That's what we're trying to avoid. We want to make it simple for all the tools to be used together as necessary for someone's task.

I could have, 6 months ago, just gone and written my own parser from scratch, to solve my problems, and had it keep around the extras. But that would have been a failure for the JS tools community. It's totally unnecessary effort. We can, and I believe we should, find a compromise that serves all the tools and use-cases, not just the existing ones.

Constellation commented 10 years ago

@getify

You're assuming a rewriter would be "dumb" enough to just discard all extras. I think the safer thing, when replacing a node, would be to preserve the extras from the node you replaced. That's what I'd call an "intelligent" transform process.

So rewriter should recognize that the tree is CST. And AST rewirters only accept AST(extras is dropped). correct?

If you can prove via code that the propsoal is strictly insufficient, let's see that. But if you just "doubt" that it can be done, I'm pretty sure I can show you code that proves it can be done.

And I also show that my proposal works fine. https://github.com/getify/concrete-syntax-tree/issues/3#issuecomment-38353295. But I don't recieve any objections. I need to note my proposal's pros / cons.

Pros

To allow CST specific node, we can introduce structural nodes, such as ParenthesisedExpression, SemicolonTerminatedStatement etc. It makes textual analysis easy and it saves the structural information (such as parentheses should be paired)
- Since CST is AST with extras proposal, parentheses is inserted automatically if operator priority requires. But this proposal doesn't do this, so we can find all parentheses information in CST itself.
Converting CST to AST is easy since we only strip the unnecessary nodes. Since both proposal traverses CST, computation order isn't different (O(N)).
Code generator is easy. It doesn't consider about operator priority.
- However operator priority need to be considered in AST to CST pass.

Cons

Need explicit conversion pass, CST to AST, AST to CST.
- However CST is AST with extras proposal also need CST to AST pass.
- And if you use AST modules, since extras property is dropped, CST is AST with extras proposal also need some pass to attach extras property to AST.
Some rewriter should consider both AST and CST. It's easy https://github.com/getify/concrete-syntax-tree/issues/3#issuecomment-38353295 but we need to take care about that.
AST modules cannot perform operation on CST. And CST modules cannot perform operation on AST.
CST becomes lower level representation than "CST is AST with extras property" proposal since it saves parentheses information needed because of operator priority. It makes modifying CST difficult.
- This is tradeoff with Pros 1. CST have all complicated information that "CST is AST with extras property" proposal implicitly has. It makes analysis easy, but modifying difficult.

Again, that sort of broad skepticism without actual demonstration is not particularly helpful.

I think this is necessary because we should choose the better format proposal to propose good standardized CST. I believe that we need to show each proposal's pros / cons. What do you think of?

If "CST is AST with extras property" proposal is already settle, this discussion is not necessary. But I remember that we still don't settle on CST is AST with extras property proposal. Is it right? https://github.com/Constellation/escodegen/wiki/CST-Proposals Since I'm on the latter proposal, I need to say "There's an alternative proposal and there's pros/cons".

millermedeiros commented 10 years ago

when I say I'm skeptical, I mean cases like this:

/* 0 */
function /* 1 */ foo /* 2 */(a, /* b, */ c /*, d */) /* 3 */{ /* 4 */
 /* 5 */
} /* 6 */
/* 7 */
var bar = /* 8 */ (a + (b /* 9 */ * c)) /* 10 */; /* 11 */
/* 12 */

it's hard to reason about the position of the comments and white spaces and structure of the program if the tokens are not on a sequential order. It's also very ambiguous if comments /* 6 */ and /* 7 */ are after the function declaration or before the variable declaration...

also things like:

while(bar( /* empty arguments, where to attach comment */ )) /* empty body, where to attach this? */;

for a code formatter I can ensure that a tree structure like this would not make the work any simpler; specially since it's supposed to only edit white spaces, line breaks and indentation. Imagine the case where you need to increase/decrease indentation inside () or {}.. or that you need to make sure you add a space after ) only if next token is not a ; or LineBreak...

In my opinion what you are describing is not a CST, parse trees usually have tokens as the leaf nodes (every single leaf node is a token), and the branches/nodes in between are what describes the structural context (grammar).

It you need to add new tokens during the conversion back into string, than it is not a CST. - AST for instance doesn't say that CallExpression contains (); and that the arguments are separated by commas, line breaks and/or white spaces; or if an Object contains a trailing comma or not.

I think the goal should be to create a format that is easy to work with - easy to manipulate and rebuild the code - making it backwards compatible with existing AST tools should not be a requirement.

millermedeiros commented 10 years ago

In my mind the CST would treat every leaf node as a token, so for use cases like instrumentation of code you could simply insert a new node like:

{
  type: 'Program',
  child: [
    {
      type: 'Identifier',
      value: 'foo',
    },
    {
      type: 'Punctuator',
      value: '('
    },
    {
      type: 'Punctuator',
      value: ')'
    },
    {
      type: 'Punctuator',
      value: ';',
    },
    {
      type: 'Instrumentation',
      value: ' console.log("executed foo() line!");'
    }
  ]
}

outputs:

foo(); console.log("executed foo() line!");

The CST tools could be dumb about the node types and only care about the value. Increasing flexibility and making it easier to rebuild the program.

getify commented 10 years ago

Everyone, please go read my proposed Charter/Mission in #4

joeedh commented 10 years ago

I'm all for CST being an augmented AST type. Why? ES6. I implemented a small class system in ES5.1, which I used in a large project of ~35k lines of code. When I added ES6 classes to my in-house ES6->ES5.1 compiler, I was puzzled at how to transform the existing class code into ES6 classes. Conceptually it's simple: just do the ES6 class -> es5.1 object transformation in reverse. The only problem is, what do I do with comments? My AST tree doesn't track them. I ultimately ended up doing the transformation by hand, but I've been running something like Getify's vision of a CST structure through my head ever since.

edwardgerhold commented 10 years ago

About the BinaryExpression example, i can tell, when evaluating a BinaryExpression, whose left node is another BinaryExpression, and it´s right node is an Identifier, the left node is treated like a ParenthesizedExpression and evaluated first before being calculated with the Identifier on the right. To generally fix precedence of an existing subtree you can simply rotate the nodes. To record the real existing parens i would either go with the extras properties, which i find very good, or mark the containing expression node as being .parenthesized = true, which wraps the whole expression and it´s subexpressions into parens. I implemented an extra ParenthesizedExpression function for my purposes. There is this CoverParenthesizedExpressionAndArrowParameterList Production, where i first scan the tokens from ( to ) (checking nested parens with a stack) and then pass it when encountering a => to ArrowParameterList or when not to ParenthesizedExpression. Unfortunately my function returns an Expression node instead of a fresh ParenthesizedExpression node. I have already read, Constellation proposed such a node. Maybe it´s the best option. It´s not overreadable. And we should be familiar with wrapping expressions since "ExpressionStatement".

edwardgerhold commented 10 years ago

Let´s figure out, how to implement it. I lex a "function", a comment comes. Ok. That´s all in the token stream. I do not throw them away. When building the CST i am at FunctionDeclaration, because i dedected function. Now i trigger next(). A comment appears. What now?

The first idea is to manually check now each time the input. That´s more work than necessary. The second is better. For example next() finds now an "extra" token. It should be added to the current node. For that the current top-level node, the function declaration, (not it´s .id Identifier or other contained symbols), should be on a stack, maybe call it currentScope. The nextToken() function encountered the extra token and calls add_extra_token(). That subroutine takes the extra Token and adds it to the extras property of the currentScope or the currentNode.

I notice, i am leading not to a full solution here, because for now it´s unclear for me where to put the comment. The FunctionDeclaration.extras property is maybe not the perfect play for the comment after the function keyword. function /* comment */ name ()because it is part of the Declaration Node which still swallows keywords into it´s .type abstraction. But in name.extras.before (the identifier node) i could save the content. { type: "FunctionDeclaration", id: { type: "Identifier", name: "name", raw: "name", extras: { before: [ { type: "MultilineLineComment", value: " comment " , loc: ... }, loc: ... ], ... } The MultilineComment should put exactly the bytes between /* and */ into the value.

Writing the parens and spaces from ( 1+2+3+4 ) is easier. You start at the left binary expression, save the paren as punctuator and the whitespace as whitespace token in the extras.before field of the binary expression. Right down the tree, i think it builds a line down the right branch, right? At the end you add to extra.after: [ whitespace, punctuator ]

The extras fields should be arrays to take arbitrary numbers of tokens. You save what the lexer lexes. They come as tokens into the extras field and not as letters or values i would say.

For catching whitespace i meanwhile made myself an opinion. Collect all whitespaces of the same code point into one token. " \t\t" would be two whitespace tokens. One with the space and one with the tab. You have the value.length for the number of characters in the whitespace token and can be sure, that they are of the same category.

Maybe the same goes for LineTerminators, but with collapsing \r\n into one LineTerminator, too, if it doesn´t mean undecidability for \r\r\r\r or \r\n\r\n. I don´t know how often \r\r appears and if a double check or a windows_lineterminator flag on the lineterminator could help. That´s a little open question what´s real for.

Pseudo code for interfacing in the next function doing it cheap by checking token.type against one [[Get]] returning true or undefined:

var extra_types = { "WhiteSpace":true, "LineComment":true, "MultiLineComment":true };
function next() { token =  tokens[++pos]; if (extra_types[token.type]) { add_extra_token(); return next(); } else return token; }

OT -

Uh, i should bring that one up here. That´s something about navigability of the CST. Processors should be able to traverse the syntax tree up by either a parent property (ruins JSON.stringifies by its cyclic structure) or by an unique_id field and a table[unique_id] = node; link where you can access the nodes in constant times, so a parent field could contain the unique_id. I would need this in example to refer in a syntax highlighter the parsed input nodes from data-attributes, that one could show partial results on moving over the code elements. And parent pointers i wished to have first time i wanted to resume a generator with a visitor recording the node where we stopped, but going up and going right would be cool for animated code analysis and such, you could just walk were you want to. In the current AST there is only one way along. Bringing that up here, that i don´t need to spend more comments for.

edited: i am stupid. The FunctionDeclaration node is the substitute for the "function" keyword itself, so adding function /* comment */ to funcDecl.extras.after would mean after the function Keyword. [[Doing so around the {} would be for the function bodies properties, where the Arrays currently don´t have .types or other properties. It´s bad advice to use for-in on arrays anyways, that´s why i would give them attributes. The FormalParameterList .formals could save type and the Parens with it´s WhiteSpace after giving the Array properties. ]]

second: I made another mistake when writing down the parens for ( 1 + 2 +3 ) . The Operator should be the middle of the node. So the left and right parens hang on the extras.before and the extras.after of the left outer and the right outer number. third: And if that´s too difficult i could suddenly imagine a .outerBefore/After Property for the binary Expression, meaning not before or after the operator, but before after the complete expression contained.

fourth: last for today. i can repeat i am stupid, because i did not realize. you can add the pieces like parentheses or other punctuators, almost manually there, where you have to compare the lookahead against these signs. You know there, then, in which parse_ function you are, and see where the sign belongs to. I didn´t realize that i would sit in front of the screen and read the code if (value === "(") { (or if (lookslike(LPAREN)) or whatever) We just have to change these lines (they are a lot, but not such a lot in reality) to if (value === "(") { add("(", node, before); ... continue }. and the punctuator goes into the right place. I can be sure to know where i put my sign in, when editing parseNewExpression or where else to put the parens to when editing parseParenthesizedExpression or when editing parseCallExpression. The whole AST is heterogenous and special for each node, and each function you have to edit on it´s own, so that should be done the same way. The point was, i noticed that we already know where to put the signs, when editing the parse function. (Seems like you should write, were you put them in which node, when setting up the paper. Maybe there are general cases but writing up for each node is probably most consistent with writing the parser itself. Or? )

edwardgerhold commented 10 years ago

reducing complexity a little and need for manual editing but still not perfect but yesterdays idea would blow the parser up this leads to some more automatic and better algorithm

first approach

CST parsing whitespace

tokenizer while (whitespace of same code unit) append

parser in the next function mapped to the current object in scope idea: a flag switches directions from before to after to give direction switch to before state - pass token - switch to after state

CST parsing lineterminators

tokenizer while (lineterminator of one category) append special case: \r\n with two characters

parser the direction flag shows where to put them onto the current node

data structure where to put before and after? create before and after only if you need to. you can not keep a current node pointer before it´s parsed. but you can keep before and after in single variables with that it´s also already decoupled for common patterns.

has state variable where after === !before or not, see the next approach

Other approach

One buffer variable Put all extras in Flush before / after node (assign - and replace buffer with new) that´s it

remembering: the before_space of one token is the after_space it´s precedent. you can flush the buffer to the after property or defer it to the next tokens before space there is a barrier to enter between both.

decorators a "decorator" could take a before action and an after action. if the buffer contains data and your node configuration says "flush", then do so.

or match function, next function are possibly to be changed to do the flushing

the maximum allowed change from a regular parser to trigger the concrete syntax actions should be if (withExtras) {} there where the code is needed, everything else counts up much too much.

getify commented 10 years ago

@edwardgerhold The heuristic I suggested was that the extra should always be added to the earliest node, and highest level node, in the tree, where it was still valid, so it would pick after on the previous node rather than before on the next node. This is sorta a greedy algorithm which swallows the extras as "early" as possible when they are encountered.

I think that simplifies things a lot. Do you have any examples from your analysis where that wouldn't work? I hadn't thought of any yet.

edwardgerhold commented 10 years ago

Ah, very good. I was also already coming to the conclusion, that it´s just a flow from left to right and there is no before and after. So we would read first all prepending extras in "Program", then start with the first node in the list, collect the extras, flush the buffer, come to the next node, flush the buffer, collect next extras, next node, flush, and so on. So it should do the same linear parsing as before.

I was with my rabbit Pünktchen at the veterinary over the whole week, he had an operation at the chin, and while sitting in the waiting room today, i fiddled a bit around in my parser for twenty minutes. i wanted to check out this:

/* 1 */ function /* 2 */ named /* 3 */ ( /* 4 */ a /* 5 */ , /* 6 */ b /* 7 */ ) /* 8 */ { /* 9 */ return /* 10 */ ; /* 11 */ } /* 12 */

I wanted to find out, what i would have to add my parser to get this working without making the parser badder.

    /* tokenizer changes just for the whitespace*/
    var withExtras = true; // just capture whitespace and comments.
    function WhiteSpace() {
        var spaces = "";
        var spc;
        if (WhiteSpaces[ch]) { // ch is lookahead(0) and lookahead is lookahead(1) in my ol´ baby
            spaces += ch;
            spc = ch;   // + added
            while (lookahead === spc) { // collect same characters only
                next();
                spaces += ch;
            }
            return pushtoken("WhiteSpace", spaces, undefined, undefined, undefined, true);                                                // i am sorry for this. please oversee these formals :-)
        }                                         
        return false;
    }

    /* parser changes - */
    var withExtras = true;  // enable CST parsing
    var extraBuffer = [];   // here the whitespaces and comments beetween two nodes will land    

    function flushBuffer() { // node.extras = flushBuffer(); empties and creates new
       var b = extraBuffer;
       extraBuffer = [];
       return b;
    }
    function intoBuffer(t) { // interfaces are costly
       extraBuffer.push(t);  // one should inline the .push if it´s finalized.
    }

    /* the parser´s next function pushes the extras into the buffer */

 var captureExtraTypes = {
        __proto__:null,
    "WhiteSpace":true,
    "MultiLineComment":true,
    "LineComment":true,
    };
    var captureExtraValues = {
    __proto__: null,
    "(": true,
    ")": true,
    "[": true,
    "]": true,
    "}": true,
    "{": true,
    ";": true,
    ":": true,  // will capture the colon of the label, but also of the ternary
    "?": true,  // maybe complete it with "?"
    ",": true
    };    

    function next() { // updated
        // ...
                t = T.type;
            if (withExtras && captureExtraTypes[t]) intoBuffer(T);
                if (SkipableWhiteSpace[t]) return next();
                v = T.value;
                loc = T.loc;
                if (withExtras && captureExtraValues[v]) intoBuffer(T);
                if (loc && loc.start) {
                    line = loc.start.line;
                    column = loc.start.column;
                } 
        //...
    }

   /*
    * prototyping CST extras with a decorator 
    * better than adding flushBuffer each function
    * decorators seem to be suitable for parser functions 
    */
    function enableExtras () {
        console.log("Enabling CST");
        Object.keys(parser).forEach(function (k) {      
            if (typeof parser[k] === "function" && !parser[k].wrapped) {       
                if (k == "next" || k == "scan" || k == "pass" || k.indexOf("JSON")===0) return; // for my hacky wacky system
                console.log("wrapping "+k)
                var originalFunction = parser[k];
                var parseFunction = function () {                
                    var b = flushBuffer(); // i 
                    var node = originalFunction.call(this, arguments);
                    if (node) node.extras = b;
                    return node;
                };
                parseFunction.wrapped = originalFunction;
                parser[k] = parseFunction;
            }
        });
    }
    function disableExtras () {
        Object.keys(parser).forEach(function (k) {            
            if (typeof parser[k] === "function" && parser[k].wrapped) 
            parser[k] = parser[k].wrapped;
        });    
    }    
   enableExtras();

I can´t spot the mistake, the parser runs an endless loop when firing up the shell after decorating all functions. So i can´t prove now having the extras already collected. But the little changes to make it possible look like something closely working.

Currently i know nothing more, but it seems to make fun to find out.

getify commented 10 years ago

@edwardgerhold

I was also already coming to the conclusion, that it´s just a flow from left to right and there is no before and after.

I'm not sure I'd go so far as to say there would be no distinction between before and after (where you always prefer after on a previous node), because there are cases where there is no preceding node to hang an after onto, so you must use before on the present node itself.

For example:

// foo
a;

produces this AST:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "Identifier",
                "name": "a"
            }
        }
    ]
}

So, where would we encode the // foo comment? The only sensible place is a before attribute on the ExpressionStatement node (I don't think my heuristic above would include attaching before or after annotations to the Program node, it's kinda special -- nothing comes before or after the program).

Another example:

var a = [/*one*/1/*two*/];

The /*one*/ comment needs to be a before annotation on the literal 1 node, as it can't go after some other node (there is no node for the [ itself).

millermedeiros commented 10 years ago

@getify what about empty arrays and empty objects? var a = [ /* am I lost? */ ]

getify commented 10 years ago

@millermedeiros That was the primary reason for inventing inside, where you would attach the annotation to the parent array literal node.

In my explorations, before and after covered about 90% of the grammar, another 9% was handled by inner, and the final 1% was covered by a virtual node like EmptyExpression. Note: these are just rough guesses at %'s, obviously. :)

edwardgerhold commented 10 years ago

@millermedeiros and @getify good to read it again. I couldn´t pick your and their points up, because i could not imagine what you are talking about. I think this little trying helps much to see the machine doing it´s job. I see, the ArrayLiteral has no [. That is, what the node stands for. The ] inclusive, i would say. We could say, before and after relate to the [, same for object literal´s { and ] gets a special second property? beforeBegin, afterBegin, beforeEnd, afterEnd. Haven´t i heard this from HTML? I already thought of innerHTML and outerHTML having some similarities. But i haven´t got a conclusion. Oh, good. I try to answer nice. Just in this moment i keep forgetting, that i captured the captureValues, with [ and ] inclusive in my extraBuffer and i flush the buffer before the node. That means the ArrayExpression receives [ /* comment / and the next node after the array would catch the ] punctuator. I think that is predictable. So on "specialNodes[type]" i could flush the Buffer again. That i can flush the Buffer with the / two */ ] back to the ArrayExpression.

Another possibility is determination of positon by name of the extras property. You have NO fixed set of extra properties, they only appear, when they are there. So you can put inner, outer, left, right (before, after in it´s words) there and the processor knows where to show by it´s property name. I think a drawing of the nodes on piece of paper and some arrow with characters flowing here and there could form a picture. I shouldn´t forget. For the // foo comment. Don´t know if i have said that or, no, i just said something about the flow from left to right but had no capture order and was behind. The algorithm should be like zero/one or many. First you get foo. Then the node. flush, Then get the next extras, then the next node, flush. Then the next extras, the next node, flush. The next extras, the next node, flush. Ok, that won´t work so easy. In the last seconds came into my mind that there are places where i took an identifier and it´s string value and placed it onto the ast instead of the node (as required by the ast) and there probably the automatic storing of extras will not work. And as you say, you can cover 90% with before and after, 9% with inner and have the empty statement. Hmm. The EmptyExpression has only a string value ";" but could have something before or after. I didn´t mention. Modern IDEs have these full-control configurations over the code formatting. I didn´t orientate after, the first idea i had in mind when thinking about was using numbers 1..n for setting the margin padding width. (margin and padding seem to be also good word to describe the captured whitespace in terms of a code formatter, best set with integer values in # of tabs or # of spaces or # of blank lines (in my pseudo codes currently determinable by token.value.length i would say))

getify commented 10 years ago

@edwardgerhold

We could say, before and after relate to the [

We could, but I don't think we need to. At least in my exploratory code, I was able to represent these cases as inside for an empty array literal, or before and after of any child nodes inside a non-empty array literal (same with object literals). What you suggest seems more complex than we need. Maybe I'm missing some nuance, but I never saw the need for all that complexity.

The EmptyExpression has only a string value ";" but could have something before or after.

EmptyStatement is represented as ;, but my virtual EmptyExpression (or whatever it might be called) would literally be represented as nothing (empty).

There's just a few rare cases where we'd need something like that, mostly in places where there is, in the standard AST, only something like an array for some thing, but the array is empty (has no nodes in it), and we need to represent some "extras" as being inside that thing. Since an array in the AST can't syntactically have a property on it (think JSON arrays), one option is to wrap that array in some object, the other option is to stick some emptyexpression virtual node inside it, so either way we get an object to hang extras off.

edwardgerhold commented 10 years ago

Yesterday i couldn´t continue, but meanwhile i´ve fixed my tokenizer´s rd properties and some parser and runtime bugs. I copied my jscodegen.js to cstcodegen.js that i can put out, what i´ve parsed in. Hope to show you soon.

I disagree with the Arrays. They can have properties. I´ve tested it a while ago, and it works fine till today. I call node.formals "FormalParameterList" and program.body "StatementList" (in es6 ScriptItemList or ModuleItemList) and so on. It works. And JSON.stringify shows these properties as well, as the arrays contents. If i enter a function, the function´s internal object is printed. The formals are printed with the [[FormalParameters]] property of the object, like [ { arg1node }, { arg2node }, type: "FormalParameterList" ]. The only point is that it ruins for-in behaviour without a test for the number, but that should be not a problem, because for-in on an array should be considered bad practice. In the paper or for using the arrays of the AST one should write, "Arrays in the AST have also named properties. Please don´t use for-in/hasOwnProperty, use for (;;) or Array.prototype.forEach to iterate or test for number". I disagree, because the arrays could be easily modified. And that would be great, because they could be decoupled from their parent container then without loosing knowledge of what they are. Just try it out. You could easily add a lot of missing informations to these "lists" (in the spec they are rather linked lists) without messing up the numbered array. If i could upgrade the AST, i would probably, no, definitive add properties to the Arrays to identify them 1:1 with what the spec says.

I didn´t get, where do you get an empty expression? (I´ll find one next paragraph.) The emptyStatement returns NormalCompletion(empty) [empty is a dummy object for an unique reference] and results in 10 when typing var x = 10; x;;;;;;;; But i have no empty Expression. The empty string just returns a program with an empty body.

To realize it i´ve just tried: () returns same empty program for me, (a) returns an expression statement with a (new from here) ParenthesizedExpression node with an identifier. I should fix () to return a parenthesized expression with no .expression ... and when i see the .expression field set to null i know, how you came to an empty expression :-)

A list of what to put where would be fine now. :-) Did i miss some link which can be seen as the precedent of this upcoming specification? Or some point in the comments which i can see as such?

Whatever, i´ll walk in circles if i write more.. So that´s it for today. Hope to see you all upgrading your esprimas, acorns and other own parsers to handle spaces, lineterminators and comments soon, hehe.

getify commented 10 years ago

@edwardgerhold

I disagree with the Arrays. They can have properties. I´ve tested it a while ago, and it works fine till today.

Of course, by saying that arrays can't have properties, I mean that their JSON stringification can't show them, not that they can't physically hold properties (they're objects, they clearly can).

And JSON.stringify shows these properties as well, as the arrays contents.

I'm not sure what I'm missing, but I don't see how JSON stringification of an array element results in showing both the indexed values in the array as well as any named properties?

var a = [1,2,3];
a.foobar = "bazbam";
JSON.stringify(a) // "[1,2,3]"

As you can see, foobar is not stringified. So I'm not sure how you are getting around that in your examples?

I didn´t get, where do you get an empty expression

I made up EmptyExpression out of thin air, as a way to have a node of some (made-up) type that could sit in a location where I needed some node to hang extras off of.

The main example was for function call expressions (not, as previously discussed, function declaration statements), like this:

/*1*/ foo /*2*/ ( /*3*/ ) /*4*/

This parses to the following AST:

{
    "type": "Program",
    "body": [
        {
            "type": "ExpressionStatement",
            "expression": {
                "type": "CallExpression",
                "callee": {
                    "type": "Identifier",
                    "name": "foo"
                },
                "arguments": []
            }
        }
    ]
}

Now, if we wanted to represent both /*2*/ and /*3*/ above, we have several options. One is to represent /*2*/ as an after extra on the callee node, and /*3*/ as an inner extra on the CallExpression node. Another option was to insert an EmptyExpression node into arguments which could hold /*3*/ and /*2*/ could be the assumed inner extra for the CallExpression.

These (and other) options were discussed at length in this long thread. I don't think there was a definitive conclusion as to the "right" way. But there were tricky cases like the new /* 0 */ ( /* 1 */ a( /* 2 */ ) /* 3 */) /* 4 */ ( /* 5 */ ) case where it wasn't clear that inner alone would suffice, and an extra dummy node like EmptyExpression could be a solution.

It remains to be finalized exactly in what cases (very rare) the EmptyExpression would be necessary, if any.

edwardgerhold commented 10 years ago

Mourn, i am stupid. JSON.stringify really doesn´t show the .type property on an array. That means data-exchange is hurt. But there is the possibility to go with .body and .bodyExtras, .sequence and .sequenceExtras, .formals and .formalsExtras or how you wish to do. Node.JS prints it out. I seek forward to implement something own to show, because it´s showing a lot of lines instead of a nice formatted internals describing object, but that´s another topic otherwhere. Well, my idea would be to use the arrays internally and save .extras of and restore them while loading. And if that´s to complex to do, i would just stick with .___Extras and put them next to the list.

added: the other point is the dummy object or empty expression kyle suggested already. I noticed it as the comment arrived as i revisited for this comment , it can be inside of the Array. Let´s define on list[0] and list[list.length-1] exactly. And it has it´s own Type "CSTExtras" for example without knowing a name. Defining such objects would be very wise. An interpreter implementing the "AST" can skip the CST Extras easily, because the writer should know your paper if he comes along that object anyways.

edwardgerhold commented 10 years ago

/1/ - callExpr.callee.extras.before (callee is foo the identifier) foo - Identifier (is callExpr.callee personally) /2/ - callExpr.callee.extras.after "( " - Convention. is where callExpr.extras.before start /3/ - callExpr.extras.before ["( ", "/* 3 _/"] ") " - then this is callExpr.extras.after = [") ", ... ] with and without ")" /4/ - it´s [")", "/ 4 /"] in callExpr.extras.after and not deferred to the next production. \n ? [ ")", "/_ 4 */", "\n"]

so callExpr´s callee stores what´s before and after the callee (identifier) and callExpr.extras stores what´s following the parens, which can be recored here, too, but again i could leave them of, because whitespace is perfectly covered by callee and what´s following in before. Got my first one :)

trying little wrong /* 1 / foo / 2 / / 2 1/2/ . / 3 / bar / 4 / memberExpr.object.extras.before = ["/ 1 /", " "] memberExpr.object.extras.after = [" ", "/ 2 /", " ", "/ 2 1/2/, " "] memberExpr.extras.before = [ " ", "/* 3 /", " "] memberExpr.extras.after = [ " ", "/ 4 */"]

I don´t think that memberExpr.property should have data, here. But now i´m hit.

foo[/3/"bar"/4/] /* 5 */ memberExpr.extras start at [ with before and at ] with after take the inner expression on .computed memberExpr.extras.before - won´t exist. and memberExpr.extras.after - will exist. the "bar" StringLiterals before and after carry the data.

Is this behaviour inconsistent ?

Is the .property of the MemberExpression an Identifier Node? (would have to look) then the solution is to take both identifiers for the extra properties and the identifier and the computed expression Which is consistent.

so again memberExpr.object.extras.before = ["/* 1 /", " "] memberExpr.object.extras.after = [" ", "/ 2 /", " ", "/ 2 1/2/, " "] memberExpr.property.extras.before = [ " ", "/* 3 /", " "] memberExpr.property.extras.after = [ " ", "/ 4 */"]

How easy is it to let each processor do it´s own with a little specification of these names? That one can put the extras there and one there and only the reading / position of where they start is a convention, so both memberExpr.extras and memberExpr.property.extras could refer to the same place? Good idea? Wouldn´t we get the same results on each machine, just the data being stored in one of both ? When writing or putting out, it´s exactly the same stream of bytes, just internally it could be the one or the other? (um, concerning this, i would answer to myself: no, it´s to easy to define one behaviour for all, that are not so many nodes)

edwardgerhold commented 10 years ago

new (a())() is newExpression newExpression.callee = parenthesizeb expression (proposed new node) newExpression.callee.expression = callexpression for the nodes, which should come out of the parser.

newExpression: "new" just new (...before ) ... after parenthesized expression: (if node is not available parse with parens into extras of in say in (a) the a) ( ... = before ) ... = after callexpression: calllee got before and after for the identifier ( ... before ) ... after

new - is automatically written or parsed /* 0 _/ - wrong: newExpression.extras.new ??? just add a third property for.

wrong: i don´t think, they are left of parens in parenthesized expression (the new node) now.
- best: it´s captured together with the parens in the current ast parsing the callee and ["/ 0 /", "(", "/ 1 /"] go all into callee´s extras property, coz parenthesized expression doesnt have a node yet. ( - newExpression.callee.extras. / 1 / - right of left paren. newExpression.callee.extras.before ["(", "/ 1 / ]
- or newExpression.callee.expression.callee.extras.before (left of an identifier) [ "/ 1 / ], a, [...] a - newExpression.callee.expression.callee.name (identifier.name) ( - newExpression.callee.expression.extras.before (callexpression before starts at "(") / 2 / - newExpression.callee.expression.extras.before (callexpression before starts at "(") ["(", "/ 2 /"] ) -newExpression.callee.expression.extras.after / 3 / - newExpression.callee.expression.extras.after ) / 4 / - newExpression.callee.expression.extras.after (read as outer) ( - starts newExpression.extras.before /_ 5 */ - newExpression.extras.before ) - starts newExpression.extras.after

the idea to add a third property for new is an idea which is easily understandable the parenthesized expresssion node was now described together with an example and how it could be done without i see, there are double cases where it could be here and there, and there i should better repeat my question, can you define it, that it can be either in these or those extras, however the processor reads it, just the result shall be exactly the same? I think it´s probably what will practically happen the whole parsing process over, that one comment could be parsed with this one or that one. Hey, that makes fun.

edwardgerhold commented 10 years ago

Maybe for the CST you should specify, that all nodes, which carry identifiers, which are strings must change the string to a real identifier node.

( edit: no , that would break the parsers if you force them to change that. i have to work around the string nodes where they are and probably exchange again. In the function foo example the identifier is just a string ... then there is still the good old i scratched it .idExtras (around foo) idea ;-) without using the function keyword as before and redefining what´s behind the idea, or just capture them with parens. ok - i see, i will come to some new clues and force myself to test all nodes with some examples somehow )

What is a "between" in a StatementList, in the array? If i enable .id Identifier Nodes, i have function /* 3 */ foo /* 4 */ with the possibility to need nothing before behind the function keyword. but if i have /* 1 */ function it could be in the prg.body [] as [ { extras }, { functiondeclaration }, { extras }, { functiondeclaration } ]

This would be completly the same like your possibly introduced node for inside empty arrays, and could be used for both purposes. A program which claims to use the CST knows to handle it, and my code for the AST would know how to skip them or turn them off in the interpreter mode or when compiling the code, or to just use the AST, which means essentially the same the diagram says.

{ type: "Extras", extras: [ { type: "MultiLineComment", value: " foo " }, {type: "LineTerminator", value: "\n" }, { "WhiteSpace", value: "\t\t\t\t" } ], loc: { start, end }

draft, changing my parse functions if (extraBuffer.length) callee.extras.before = exchangeBuffer() which could be in one step exchangeBuffer(/*dest*/callee.extras,/*target*/"before") which does if (extraBuffer.length) dest[target] = extraBuffer; extraBuffer = []; Maybe i ask if (withExtras) before, if i have a parser doing all these jobs to save the function_call, which is more in a parser than a bool. (but a parser with decorators to change behaviour still wakes my interest it could also be wrapped with a compiler or be wrapped with extras at the cost of the call itself plus the code)

Something to strike in the list are the .type properties for the arrays. One should allow them with one clause for descriptive or whatever reasons but write in the same clause that JSON.stringify and .parse will not support them, so they get lost and the system MUST NOT depend on. Other things to scratch off? I will never need bodyExtras or formalsExtras, if i use the parsing i tried to figure out like you above.

( edit: later i come to the conclusion i could need them exactly there, where identifier nodes are replaced with strings and i don´t want to break the existing parser)

But again to put on, that is the concrete { type: "Extras", extras: [] } object for inside arrays. This idea seems to be very good, if the "Extras" node could be placed anywhere in the arrays, between two nodes. That covers the cases before keywords for example.

A good idea to figure out, how is to do each comment wrap for every node, or? Than most of the cases (no, all, except for the rule which must be defined that interpreted to be either in this extras or in that extras) are covered. They will simply be nested (where i suspect that the layering will happen).

edwardgerhold commented 10 years ago

Hi,
i´ve written this up to recap what i´ve done so far. 
Not much, concerning that i started working on my tool
again but did more the other things than implementing the 
CST. I´d forgotten, what i left, during lecture and walked
into a pile of left over work.

```js

/*
    Tokenizer
*/

// from the beginning
function lexWhiteSpace() {
    var spc = lookahead;
    var spaces = "";
    next();
    while (lookahead === spc)
        spaces += spc;
        next();
    }
    return { type: "WhiteSpace", value: spaces };
}

/*
    Parser
*/

// a real function, already shown
function next() {       
    // count up, fetch next token, analyse
    if (withExtras && captureExtrasByType[t]) extraBuffers.push(token);
    // ..
    if (withExtras && captureExtrasByValue[v]) extraBuffers.push(token); // i disabled this line
    // skip ws and comments for the parser
    if (but_regular_skipable[t]) return next();
    // return token
    return token;
}

// many things are done when standing on a specific token
// maybe it helps to automatize some of it with match
// the can be boolean/string state plus a currentNode pointer
// updated with each progress. I haven´t tried this out yet.
hypothetic function match(expected) {
    if (v === expected) {
        if (withExtras && extraBuffer.length) {
            // require a pointer to currentNode on the stack                
            // require a flag set to before/after or the field
            dumpExtras(currentNode, flag); // assign to node.extras[flag] and 
        }
        next();
    }
    throw new SyntaxError("expected "+expected+", got "+v);
}

// the lazy creation of the extras object is possibly not the preferable way
// i a handwritten-update of a parser maybe each function should create var extras;

function dumpExtras(node, prop, dir) {   // dumpExtras(id, "id", "before");     
    if (!node || !extraBuffer.length) return;
    if (!node.extras) node.extras = {};
    if (!node.extras[prop]) node.extras[prop] = {};
    node.extras[prop][dir] = extraBuffer;
    extraBuffer = [];
}

function dumpExtras2(node, prop) {   // dumpExtras2(id, "before");        
    if (!node || !extraBuffer.length) return;
    if (!node.extras) node.extras = {};
    node.extras[prop] = extraBuffer;
    extraBuffer = [];
}

// now some real parser function, skip that i can refactor getting and storing loc information, too.
// comes together 

parser.EmptyStatement = EmptyStatement;
function EmptyStatement() {    
    var node;
    if (v === ";") { // ot: explement this to there where emptystatement is called and test cheaper 
        node = Node("EmptyStatement");  
        node.loc = makeLoc(loc && loc.start, loc && loc.end);
        if (withExtras) dumpExtras2(node, "before"); // has everything prepending the ; 
        pass(";");  
        if (withExtras) dumpExtras2(node, "after");  // has everything between ; and the next() token (which could take the buffer if it´s not dumped here)
        if (compile) return builder.emptyStatement(loc);
        return node;
    }
    return null;
}

// i am better at parser writing today because of lectures but essentially it´s closely equal 

parser.ReturnStatement = ReturnStatement;
function ReturnStatement() {
    var node, l1, l2;
    if (v === "return") {
        l1 = loc && loc.start;
        node = Node("ReturnStatement");

        if (withExtras) dumpExtras2(node, "before"); 
        pass("return");
        if (withExtras) dumpExtras2(node, "after");

        if (v !== ";") {
            if (ltNext) {
                l2 = loc && loc.end;
                node.loc = makeLoc(l1, l2);
                return node;
            }
            node.argument = this.Expression();
        }
        skip(";"); 
        l2 = loc && loc.end;
        node.loc = makeLoc(l1, l2);
        return node;
    }
    return null;
}

// Because i am not through i can not prove a left-to-right flow with just one
// extras property each token.

/*
    CodeGenerator
*/

// this function handles the translation of the builder arguments for all types
// all following are real functions

function callBuilder(node) {
    // if no node return empty string, we may be concatting
    if (!node) return "";
            // if it´s an array, concat the results of all
            if (Array.isArray(node)) {
                var src = "";
                for (var i = 0, j = node.length; i < j; i++) {
                      src+=callBuilder(node[i]);
                }
                return src;
            }
    // let´s look at the type to create the arguments array.
    var type = node.type;
            var args = []; // will be passed calling the builder interfaces
    switch(type) {
        //..
      case "WhiteSpace":        // they return their tokens value and come from the extras buffer
      case "LineComment":
      case "MultiLineComment":
      case "LineTerminator":
            args = [node.value, node.loc]; // they are from the tokenizer stage and have no extras prop.
      break;
       // ..
      default:
            return "";
    }        
    //adjusting to the interface
    var len = type.length;
    var name = type[0].toLowerCase() + type.substr(1, len-1);
    // the difference can be mapped by a hashtable { "Statement":"statement" }
    return builder[name].apply(builder, args);
}

builder.whiteSpace = function (value, loc) { // they have no ,extras argument 
    return value;
};
builder.lineComment = function(value, loc) {
    return value;
};
builder.multiLineComment = function (value, loc) {
    return value;
};
builder.lineTerminator = function (value, loc) {
    return value;
};

// I started to modify the builder operations
builder.emptyStatement = function emptyStatement(loc, extras) { // they have all , extras
    var src = "";
    if (extras && extras.before) src += callBuilder(extras.before);
    src += ";";
    if (extras && extras.after) src += callBuilder(extras.after);
    return src;
};
builder.withStatement = function withStatement(obj, body, loc, extras) {
    var src = "";
    if (extras && extras.with) src += callBuilder(extras.with.before);
    src += "with";
    if (extras && extras.with) src += callBuilder(extras.with.after);
    src += " (" + callBuilder(obj) + ") " + callBuilder(body); // has to be continued here
    return src;
};
builder.debuggerStatement = function debuggerStatement(loc, extras) {
    var src = "";
    if (extras && extras.before) src += callBuilder(extras.before)
    src += "debugger";
    if (extras && extras.after) src += callBuilder(extras.after);
    return src;
};

// i see, i will manually have to replace all nicely formatted " (" and ") "
// with calls to the builder to put the original back together.


Let´s show the results of this implementation:

``` js
linux-dww5:~ # es6
es6> function x() { /* 1 */ return /* 2 */ 10; }
undefined
es6> x.toString()
function x () {
       /* 1 */ return /* 2 */  10;
}
es6>

So far. I didn´t manage it to do how i wanted, because i walked onto a pile
of work opening the project again. The parser needs more work before it can
be used.

edwardgerhold commented 10 years ago

I am sitting at an outdoor subway station waiting for my girlfried. Over the week i have worked on the tokenizer and parser to remove the critical bugs which lock it up. Today i got so far, that the proposed decorator functions work. The thing is, the concept isn´t developed very far there, it´s just a good starting point. I guess i will take another week to catch up until i can report.