getify commented 10 years ago

I am going to try to articulate in detail what my initial plans and goals are for this group. These things are of course open to discussion, but I want to stress one thing at the beginning:

This is not design by committee -- those types of efforts usually fail miserably. This is not democratic -- those things usually just divide constituency and never lead to consensus. I am going to lead this group and effort, because it needs to happen and because no one has done it before, so I stepped up. I am going to do the best I can to balance what input people give with the overall big picture and goals.

And I am going to actively seek your feedback to make sure we do the best we can.

Mission Statement

To develop a CST (concrete syntax tree) format which preserves/represents concrete syntax elements (whitespace, comments, etc) in a data structure (tree) "alongside" the AST (abstract syntax tree) elements (VariableDeclaration, BinaryExpression, etc), and to evangelize this new format to all tools in the JS tooling ecosystem (parsers, analyzers, transformers, code generators) to gain substantial/complete adoption as a new IR (intermediate representation) standard.

Charter

This ad hoc group of volunteer members seeks to develop a single, new standard (which we're currently calling "CST" -- Concrete Syntax Tree) for the IR (intermediate representation) of JS source code as it passes through various tools in the JS tooling ecosystem, from parsing to analysis to transformation to code (re)regeneration.

This new standard will replace/augment the existing standard (AST -- Abstract Syntax Tree as standardized by the SpiderMonkey Parser API). Note: That does not mean ASTs won't exist, but it means the preferred tree format for exchange will be the CST, and that AST will become a reduction of the CST for certain use-cases that only need/care about AST information.

The goal of the new standard tree format is to provide a standard and reliable way to represent all "concrete syntax" elements which are normally discarded and not kept by the AST, such as "unncessary" whitespace, comments, grammar-redundant ( ) or { } pairs, etc. These elements are needed for a variety of use-cases which cannot tolerate "information loss" in the round-trip from source-code to IR back to source-code.

This group seeks to discuss several existing proposals for a CST format, hammer out any problems with them, and find one that can gain the most support. We will publish a detailed specification/documentation for this new format, and evangelize and work with all the JS tooling ecosystem members to gain adoption and implementation as quickly as possible.

It is a success condition of this group if many or all JS tools agree, even in principle, to eventually move to this new standard IR format, even if they only can agree to support both CST and AST rather than replace AST. Furthermore, there must be at least one parser that produces CSTs from source code, and at least one code generator which takes a CST and produces code output, as this round-trip is inherent to nearly all use-cases this group concerns itself with.

Assumptions/Observations

AST is a lossy format, in that concrete syntax information is lost when a parser takes a program and outputs an AST. This lossy format has served many common use-cases well, but it has not served at all the use-cases which need to retain (and/or use!) this information.

As such, the new CST format that will retain this information must be seen as the primary format, as you can always strip out information from a CST representation to get only an AST, but you cannot go the other direction and restore information which was lost. Note: Some use-cases do call for adding in new "concrete syntax information", such as default whitespace, etc, but that's different than preserving (while parsing) the original information.

We will pay close attention to the tradeoffs in complexity/performance that this implies, and be sensitive to that in what we propose. We will not grossly degrade the performance of existing tools by forcing them to do things like tracking concrete-syntax which they have no concern with, except as it is minimally required to support the rest of the JS tooling ecosystem and use-cases.
This group is not an open-ended, unstructured exploration. It will be guided and informed by prior work, and seek to keep to the narrowest scope and process as necessary to get to a proposed solution with widest adoption/consensus.

There have been extensive discussions about various approaches to CST tracking over the 6+ months in various places. Two main proposals surfaced in that discussion, as detailed in the main README of this repo. It is my goal that we first validate both of these proposals, or invalidate them (with proof, not opinion).

We will entertain (and indeed, seek out!) discussion about concrete deficiencies in these proposals, but we will not entertain bikeshedding on opinions of taste on any proposal. If there are unresolvable deficiencies in current proposals, we will entertain alternate proposals, but again we will not get mired in bikeshedding, but rather seek to solve whatever problems exist in any given attempted solution.
Since AST is the current standard IR format with these tools, whatever the CST settles as must provide the least amount of friction to existing tools to augmenting or replacing current AST handling.
Throwing out the entire AST format and producing a new CST format that is wholly unlike the current AST is likely to produce a lot of friction to implementation with existing tools, even if it can be demonstrated that it would be superior (for some definition of "superior").

As such, a CST that augments AST in some way is generally more preferable as it generally would lead to less friction to implementation (less changes to existing tools' code). We should prefer incrementation/evolution of the current standard rather than reinvention.
The form that the CST takes (and how it co-exists with AST elements) matters, because it directly affects how easily the IR of code can pass through multiple tools in a chain of processing. A single tree (that can be textualized as JSON) is the current norm, and it is preferred (again for ease of friction) that the process not become significantly harder, such as creating multiple different streams of data to pass around, etc.
There have already been a lot of ad hoc explorations by various tools to tracking whitespace and/or comments, but each tool has done it differently, and none of them have handled all concrete-syntax preservation. All these different previous attempts inform our current attempts, but they are explicitly considered insufficient as the mission is to preserve all concrete-syntax in a standard and agreed-upon way across most/all tools.

As such, the CST effort seeks to replace any of those previous non-standard approaches, even eventually if not immediately. We want to solve problems, not create more problems for future users by having multiple different competing ways to do things and no consensus on how to do it properly.

millermedeiros commented 10 years ago

I think you are "throwing the baby out with the bath water.." with all the assumptions/observations. This should be more a description of the problems we are trying to solve than a list of preconceptions about what should be the correct solution. Each problem requires a different approach and by locking down to the current AST tree structure you might be making things harder than they should.

If that is the approach for the development of the "standard", I'm not interested anymore since it won't make my work any easier and I don't have that much free time anyway...

this issue even made me want to create another "standard" that would be better suited for code formatters, code instrumentation and linters...

standards

PS: if you guys really end-up augmenting the AST you should call it AST++, it is not a CST.

millermedeiros commented 10 years ago

The main concern shouldn't be on how hard it is to convert between CST & AST.. the focus should really be: how hard is it to manipulate the data? how hard/expensive is it to rebuild the program?

getify commented 10 years ago

@millermedeiros

I'm sorry you feel so strongly against how I'm approaching the process. I do hope you'll reconsider and help us make something good.

with all the assumptions/observations

I think the discussions and experiments that have gone before this project should inform and guide it. I'm trying to keep the scope narrow for now. I would rather someone prove that the approach is insufficient than simply express distaste and go away. If we can prove the scope should be more broad, or that other approaches are necessary, we will adjust and expand.

But I'm not going to start with "ok, everybody, let's throw every thing we can imagine against the wall and see what sticks." That's asking for this process to die. I sincerely want it to succeed.

you might be making things harder than they should. the focus should really be: how hard is it to manipulate the data? how hard/expensive is it to rebuild the program?

As I've said several times, I have been writing transformations and code formatting logic against my proposal (a CST that's a super-set of the AST), and I've found it substantially easier to do so that way than other approaches. So I guess it's subjective what's "easier" and "harder". But I am certainly trying to make something easier.

I don't think it's fair to ascribe mal-intent or ignorance to the process that's just getting going.

Then again, I think I'm arguing from the perspective of quickest path to greatest interop among a lot of tools, and it feels like you're arguing that your project and way of doing things won't benefit from that proposed approach.

Surely you can see the needs of all tools and interop is greater than the needs of any one tool? Again, I hope you'll reconsider form the perspective that the easiest/quickest to implement standard (which includes modifying quite a few existing tools dealing with ASTs) will be the best option to widest interop.

Am I certain the current proposals under discussion are the right option? Absolutely not. I strongly suspect they're pretty close. I've heard quite a few "I don't like that" or "I'm skeptical" or "Have you thought about this...?" (which I have), but as of yet we haven't seen anything that demonstrates that we're totally off base.

We should give some honest and earnest and open-minded effort to what we have before throw it away and move on to something else.

getify commented 10 years ago

create another "standard" that would be better suited for code formatters

Since I started my exploration of this topic almost a year ago specifically because I'm trying to build a configurable code formatter, and since then that's exactly what my code experiments have been towards, it again seems rather one-sided to think that we're so off base on that use-case that you'd need to hastily rush off to divert attention and efforts.

I know you have an opinion that I'm off base and misguided and mistaken, but I think your influence and experience will help us more if you stay than if you go.

ericelliott commented 10 years ago

Can we link to some reference implementations of existing CSTs?

joeedh commented 10 years ago

I like this mission statement. I look forward to the proposals, since comment/WS tracking is something I need to add to my compiler.

I do think the section on the AST format should be clearer. I interpreted it as being somewhat flexible; e.g. each app could have it's own tweaks to the AST, but would communicate with each other via some standard form (ES5.1?). I could be totally wrong, but that interpretation seems rather different than millermedeiros's interpretation.

getify commented 10 years ago

@joeedh the spirit of the project (regardless of the proposal we settle on) is that there would be one standard for the CST. That is, internally, tools can do whatever they want, but the externally available CST tree used for input/output would need to adhere to the one standard. If you have any suggestions on how I can make that clearer, happy to clarify. :)

jsoverson commented 10 years ago

This is not design by committee -- those types of efforts usually fail miserably. This is not democratic -- those things usually just divide constituency and never lead to consensus.

These are assumptions. Plenty of committees don't fail miserably, and democracies don't need to work on consensus, that's part of the point.

I am going to lead this group and effort, because it needs to happen and because no one has done it before, so I stepped up. I am going to do the best I can to balance what input people give with the overall big picture and goals.

This is also unnecessary and off-putting. Whoever leads the most effectively ends should be the leader. You are currently trying to do that, but self-congratulation + an edict makes this an unpleasant start.

I'm not ready to ignore this, but it's easy to side with the initial reaction from @millermedeiros.

getify commented 10 years ago

Plenty of committees don't fail miserably

I didn't say it's not a committee. I said it's not design-by-committee, which is a distinct thing. Look at how TC39 governs itself. They have said they don't "design by committee", but instead have champions of each topic who lead the effort, present their results, and work towards consensus. While I'm not on TC39, I'm trying to abide by a similar spirit.

I'm also trying to avoid the nightmares of other design-by-committee efforts I have been part of, where the crushing weight of bikeshedding and "hey, I've got a whole completely different idea just for the sake of noise" prevent anything from actually ever happening.

This is also unnecessary and off-putting

I'm sorry this is coming off as off-putting or self-congratulating. I certainly have absolutely no such intention.

I've seen far too many "committees" not make such guiding principles clear, and I've seen that lack of clear expectation spiral quickly out of control. I don't want that. I want us to actually succeed, and I want that to happen sometime in the practical near future.

joeedh commented 10 years ago

@jsoverson design by committee may not always fail, but it doesn't always succeed, either (the aborted ES4 spec is one example, as was the Gimp project a few years back). I think @getify has it right here.

@getify, that makes sense. Would the CST be based on es5.1 or es6? If es6 is going to take a while to fully roll out, it might make sense to go with es5.1.

getify commented 10 years ago

@joeedh

Would the CST be based on es5.1 or es6

I'm not sure that the ES5.1 parts are going to change in the ES6 AST (doubt it, anyway), so I'm not sure that it terribly matters which one we pick. Certainly, we have to be open to future changes to the AST and sort of automatically (or as easily as possible) manage the CST format accordingly so that we keep updated with future JS.

One thing I know is that some of ES6 has filtered into the AST produced by esprima and others, so we can already start to evaluate those. And we have to be future thinking to make sure the new ES6 stuff isn't going to present deal breakers that our CST idea cannot handle.

Bottom line? I think the answer is: both. :)

ljharb commented 10 years ago

Confining support to ES6 will be incredibly limiting for the (I assume majority) of web developers out there who are still supporting ES3 browsers (IE 8 here), let alone when we'll be able to even have, let alone be able to solely support, ES6 browsers.

I'm all for supporting new language features but I'd rather start with a tool that works with ES5 and not with ES6, than the reverse, and I'd find that way more useful.

getify commented 10 years ago

@ljharb If we go after a CST like my proposal, which is basically adding attributes as a layer of extras on top of whatever AST nodes are in the tree, I'm not sure how it could be exclusive in either direction? Am I missing something?

ljharb commented 10 years ago

Nope, just replying to the implicit suggestion in @joeedh's comment that we'd pick ES6 if there was a choice. I don't think there has to be a choice, or should be one - I think CST as a strict superset of AST will solve tons of problems and work for everything :-)

joeedh commented 10 years ago

You have a point about supporting ES3 browsers. But like @getify says, we should be able to support all three, given that ES5.1 (if I'm remembering right) doesn't add any new grammar, while ES6 could be supported by allowing the use of new grammar productions, while ignoring the ones ES6 removes (I can't remember, was it just the with statement? Did they even end up removing with?).

I feel silly now that this didn't occur to me before getify mentioned it, heh (and I wrote the question).

On Wed, Mar 26, 2014 at 12:02 PM, Jordan Harband notifications@github.comwrote:

Nope, just replying to the implicit suggestion in @joeedhhttps://github.com/joeedh's comment that we'd pick ES6 if there was a choice. I don't think there has to be a choice, or should be one - I think CST as a strict superset of AST will solve tons of problems and work for everything :-)

Reply to this email directly or view it on GitHubhttps://github.com/getify/concrete-syntax-tree/issues/4#issuecomment-38725644 .

getify commented 10 years ago

I would personally want to write the CST standard as not detailing what nodes are or are not in the tree, but more like a thin veneer layer over the tree, where we talk about extras and only mention specific node types as examples of how they attach in from various examples. That way, the CST spec is "whatever the underlying AST tree spec is" + "extras", which means as a spec, it survives longer (and goes back further).

Surely, there will be some (hopefully limited) cases in the future where we need to update the CST spec to give some specific guidance on some new form that shows up in ES7 or whatever, but again, I hope the CST isn't a re-hashing of all the AST stuff, but rather a superset augmentation of it.

FWIW, the other proposals which seem to create separate data structures for CST (rather than layering on top of it) seem much more brittle in this sense, because they probably have to be a lot more explicit about all the nodes and node types, which means they're making lots of explicit references to some particular grammar. I think that path will make the CST spec harder to maintain going forward.

getify / concrete-syntax-tree

Charter, group mission, observations/assumptions #4

Mission Statement

Charter

Assumptions/Observations