bablr-lang / bablr-vm

A VM for enforcing language rules on agAST trees
MIT License
40 stars 2 forks source link

Feature: Comment attachment #3

Closed conartist6 closed 2 years ago

conartist6 commented 2 years ago

Right now I have problems with comment attachment caused by my interchangable references. Let's say you have this code:

import {
  // one
  one,
  // two,
  two
} from 'numbers';

And you write a codemod:

node => node.specifiers.unshift(t.ImportSpecifier(t.Identifier('zero'), t.Identifier('zero')));

The problem is that now when you go to reprint you'll end up with:

import {
  // one
  zero,
  // two,
  one,
  two
} from 'numbers';

That's essentially a design feature of our references, but here it hurts us. They are too simple!

conartist6 commented 2 years ago

This can also happen with whitespace:

{
  statement1;

  WHOA();

  statement2;
}

might become:

{
  statement0;

  statement1;

  WHOA();
  statement2;
}
conartist6 commented 2 years ago

Another nasty forumlation:

{
  fix1();
  fix2();
  // Everything is fixed! 
}

And perhaps now only one fix is required, but removing the second call invalidates the part of the tokens array that holds the second comment.

{
  fix1();
}
conartist6 commented 2 years ago

It should be noted that we already provide the best remedy possible to any of these issues: the individual writing the codemod can tell us what they meant by updating the tokens array themselves. We could trivially offer helpers for tasks like appending or prepending into an array-valued property's array while also appending or prepending a new reference into tokens to ensure that none of the off-by-ones occur. We'd probably need to fork some common helpers like add-module-imports to do this. In the long run I suspect the approach of building an ecosystem of formatting-aware tools is the only way to create really good outcomes. The essential choices users of those tools make will be reduced to configuration as opposed to manual tokens array manipulation.

conartist6 commented 2 years ago

I am most concerned about shifting comments around. People are going to want their comments to stay attached to things. They will say that the output isn't just weird when the comments shift, it's WRONG.

...and I don't want to destroy one of the best characteristics of my solution and of recast -- that it allows you to manipulate the AST without worrying about formatting except where strictly necessary.

I think the only possible "magical" solution is to have the reference tokens have some way of retaining information about exactly which node they were referencing. Then you could go through and basically do a diff in order to figure out where references were added, removed, or reordered. You could then use that information to avoid entering fallback even when the tokens array doesn't quite match the current state of the AST.

I'm against including the reference itself in the reference token object in such a way that it would be accessible to users. I think the best way to encode the data would be as WeakMap of nodes keyed on reference token. Then your basic logic would be something like:

diff(
  node.tokens.filter(token => token.type === 'Reference' && value === 'specifiers').map(ref => nodesByReferenceToken.get(ref)),
  node.specifiers
)
conartist6 commented 2 years ago

I think this is pretty necessary to real users, ands significant value, and is an ideal kind of logic to design as part of the prototyping phase. Let's see if I can do it!

conartist6 commented 2 years ago

Thinking about this some more. What if I do something different? What if I always push optional comments and whitespace downwards instead of upwards? For example when the code is [/*a*/a,/*b*/b] I use node.range to determine which parts of the text belong to child nodes. Parsers will tell you that /*a*/ is a token of the array expression, but a is a token of an identifier child. I think it would make sense to renormalize the data such that /*a*/ is a token associated with the identifier not the array expression. Then that identifier could be reordered or moved any which way and its comment would follow it.

BUT. We want to be careful. If all whitespace belongs to children, this will also not be good. For example when a transform swaps a and b in [a, b] the result would be [ b,a] instead of [b, a]. Maybe this isn't so bad though! Any pretty-formatter is capable of taking care of fixing it up, and usually rules about such spacing are defined project wide.

conartist6 commented 2 years ago

Ah, but I have been thinking specifically about prettier and its object formatting. I'd essentially never want to split up any of these patterns:

conartist6 commented 2 years ago

I think if I can do that -- push everything but those patterns into children when possible -- the result would be excellent in most common cases without much of any additional work

conartist6 commented 2 years ago

A case to consider:

o = {
  foo, // foo
  bar, // bar
}

This is unpleasant to deal with because those comments can't possibly be tokens for the foo identifier, but they are also expected to stay attached to foo

conartist6 commented 2 years ago

I could store the illegal structure in my CST. For reasons that should be obvious you can't work with this directly as text, but you can store it and use it as an intermediate step:

o = {
  foo // foo,
  bar // bar, 
}
conartist6 commented 2 years ago

I don't think I can endorse that. The data structure needs to be interoperable. With luck it could even become part of the ESTree spec. Storing tokens out of order like that and demanding that everyone else handle it would seriously damage the value of the whole structure.

conartist6 commented 2 years ago

We could ensure that those line comment tokens are not pushed down, which leaves us to build both pushdown and diffing. Could be better, could be worse I guess.

conartist6 commented 2 years ago

We could just convert them from line to block comments on the fly...

o = {
  foo /* foo*/,
  bar /* bar*/,
}

I don't think people would want to use the tool unless we somehow knew to convert them back before printing.

gibson042 commented 2 years ago

Another similar but less-common case:

// Comment about `if (cond1) { … }`.
if (cond1) {
    …
// Comment about `if (cond2) { … }`.
} else if (cond2) {
    …
// Comment about `else { … }`.
} else {
    …
}

I think such representing such relationships is technically out-of-scope for a syntax tree, but would support a convention in which they appear as extension data, ideally with JSON interchange compatibility (e.g., "tokens": […], "tokenAttachments": [{ "type": "RelativeReference", "selector": "~[1:2]" }][^1]). But it would be the responsibility of manipulators to respect such references and and keep them synchronized.

[^1]: Where ~ is an array of all tokens following the context node and [1:3] is a slice of its elements from index 1 inclusive to 3 exclusive—corresponding for foo, // foo with the Whitespace and Comment nodes following the immediate-successor comma Punctuator node at index 0). Unfortunately, referencing siblings and/or parents seems to require inventing syntax, because JSON Pointer can't do either while JMESPath, JSONPath, and even CSS are too powerful (and even so, the latter two still lack parent references).

conartist6 commented 2 years ago

I think instead of extension data I'm more likely to offer some CST-aware APIs for common operations like removeNode. If you are removing a node that is an alternate it can check for any trailing comments in the parent's consequent and remove them.

The good news is this means I can make the the calls configurable, like removeNode(node, { noRemoveComments: true }).

conartist6 commented 2 years ago

OK I've given this more thought.

First. A line comment on its own line as part of a statement list is ambiguous. It could be about the next line as with:

// describe conditionA
if (conditionA) {
// describe conditionB
} else if (conditionB) { 
}

or it could be about the state of the program at some moment in time as with

transitionA(state);
transitionB(state);
// state is now C

So there's definitely nothing I can do with that, and the best thing is to think of it like it's its own statement, like an assert. There will be some weird results sometimes when comments that should've been attached are treated as independent, but I'm not creating that problem, I just can't solve it here. It may still be possible to provide excellent tools to identify locations in the diffs generated by codemods which could point out places where statements next to line comments were removed.

conartist6 commented 2 years ago

For initialization contexts like import specifiers, parameter definitions, and object and array literals the fundamental ambiguity is whether a line comment is about one line, or a group of lines that follows. E.g.

o = {
  // a
  a,
  // b
  b,
}

vs

o = {
  // a's
  a1,
  a2,
  // b's
  b1,
  b2,
}

Again I think the only sane thing to do is to consider the line comment to be part of the parent. We risk leaving comments which describe something that no longer exists, but as I still think it's fine to say that that's just an issue that would have to be caught in code review.

conartist6 commented 2 years ago

Also I think I was wrong to say that we can have the transforming code handle it. For there to be a useful ecosystem of transforms it will need to be possible for writers of transforms to describe changes to codebases whose formatting they will not know anything about.

conartist6 commented 2 years ago

When line comments are on the same line as a statement, I think it is reasonable to consider those comments to be attached to that statement, even if there is an intervening comma. I guess I think I'm leaning towards just pushing them into the tree out of order, even though I said I thought it would be bad. The fact is, this tree is going to be messy. It's going to contain things that may be invalid or don't quite make sense. For example it will not be uncommon for an altered tree to contain line comment tokens which are not followed by any newline token. Printing the tree is never going to be as simple as just concatenating the tokens, so maybe it's best to embrace that. If I attach line comment tokens to the wrong side of the comma, I can work it out at print time. Then the primary difficulty is just in offering some APIs that help you query a "normalized" tree structure when in fact what you have is a non-normal tree (one with inconsistencies).

conartist6 commented 2 years ago

I think it's time to design the traversal API, which is the only API that will be able to answer queries about lines since it will have be able to have context about parent and thus about sibling tokens. Its path object can offer the APIs for on-the-fly tree normalization.

conartist6 commented 2 years ago

Maybe I just have control inverted. I've discussed two main strategies, diffing and pushdown. The library definitely needs to implement diffing. It's crucial to get right in order to support the most basic and common operations.

Pushdown though... Comments are just pretty fundamentally ambiguous. It's pretty clear that sometimes it's important for certain comments to remain stuck to certain nodes, but it's not always clear what those cases are. The best examples I can think of are JSDoc comments. I'm pretty sure that in almost every case it makes sense for JSDoc comments to be pushed down into the nodes they describe, and yet I don't really want to design my library such that it knows about JSDoc.

No, I think the thing to do is to make pushdown one of the fundamental types of operations this library supports doing on the tree. Since our intention is to treat trees as mutable, we can just return a CST, then the holder of that tree can call back in triggering the pushdown rules they wish to trigger, e.g.:

import { rebuildTokens, pushTokensDown, print } from 'cst-tokens';
import JSDocPlugin from '@cst-tokens/plugin-pushdown-jsdoc';

const cst =  /* ... */

rebuildTokens(cst);

pushTokensDown(cst, [JSDocPlugin]);

transform(cst);

This ensures that we don't churn the library core on formatting rules, it allows each project to specify the pushdowns that actually make sense for the way their code is written, any configuration is possible, and the transforms themselves don't need to know. If a transform needs to remove a node, it just removes it. If the environment had pushed comment tokens into that node, then those tokens are gone! Everyone gets what they want, and I especially get what I want in not having all of it be my problem!

conartist6 commented 2 years ago

All in all I am hoping it would be possible to write something like this:

import { readFile, writeFile } from 'fs/promises';
import { rebuildTokens, pushTokensDown, print } from 'cst-tokens';
import JSDocPlugin from '@cst-tokens/plugin-pushdown-jsdoc';
import TypeCommentPlugin from '@cst-tokens/plugin-pushdown-type-comment';
import PureCommentPlugin from '@cst-tokens/plugin-pushdown-pure-comment';
import prettier from 'prettier/cst'; // wishful thinking
import eslint from 'eslint/cst'; // more wishful thinking

async function codemod(path, transform) {
  const cst = someParser.parse(await readFile(path, 'utf-8'));

  rebuildTokens(cst);

  pushTokensDown(cst, [JSDocPlugin, TypeCommentPlugin, PureCommentPlugin]);

  transform(cst);

  eslint(cst, { fix: true, quiet: true });

  prettier(cst);

  await writeFile(path, print(cst));
}
conartist6 commented 2 years ago

There's some interesting discussion of the issues here: https://github.com/rome/tools/issues/2768

conartist6 commented 2 years ago

OK OK OK I can do it! Pushdown comment attachment will still be the only way to keep comments attached during arbitrary moves. In the specific instance of adding, removing, and reordering items in an array however I should be able to keep comments and whitespace attached without needing any extra configuration, which is ideal.

I can do this because the current internal return from traverse is a match tree, currently represented as [rootMatchNode, matchNodesByRef]. These two pieces of information constitute a tree. This is because matchNodesByRef is a WeakMap whose keys are ref tokens (matchNode.cstTokens.filter(tok => tok.type === 'Reference')) and whose values are matchNodes. Each matchNode also includes the original node it pointed to in matchNode.node.

This is perfect since it captures the original contents of the array in the match tree so that they can be retrieved like this:

matchNode.cstToken
  .filter(tok => tok.type === 'Reference' && tok.value === name)
  .map(matchNode => matchNode.node)

This works due to our fundamental assumption that arrays of elements are in source order -- an assumption that already exists due to the reference-counting nature of RefResolver, and one that seems logical for parser output.

Once we have the current array stored in node[name] along with the reconstructed copy of its contents from the initial parsing, we can make a map of the nodes and determine the order in which to emit the tokens we have so that whitespace and comments (which the parser always bubbles upward to preserve their ambiguity) stay with the nodes they originally described. I'm glossing over some significant complexity there though, because we'll have to know which tokens belong with which references. That means we'll have to know which tokens delimit list items. JS uses commas, semicolons, and newlines.

The thing is, sometimes the rules are a bit wonky. Prettier for example changes

call(
  arg1, /* comment */
  arg2,
)

into

call(
  arg1 /* comment */,
  arg2,
)

This implies that to avoid breaking that behavior we'll want a prettier pushdown plugin which applies its logic about line breaks being more important than element separators in comment attachment.

conartist6 commented 2 years ago

matchNode.cstTokens === matchNode.node.cstTokens after an updateTokens operation, so this should also work fine in conjunction with CST-aware tools. Those tools would update both node[name] and node.cstTokens when executing a reorder of their own, which would make that move invisible to the diffing system.

conartist6 commented 2 years ago

I've shifted tacks once again. The more I got into the mechanics, the clearer it became that diffing was going to be insane. There are just too many situations in which it's not clear what should happen to comments and whitespace. Since my goal is to eliminate insanity, I've eliminated diffing.

Instead I am changing the core premise of the library. Instead of working with malformed CSTs, the library now always validates that CSTs are well-formed.

This means that the only comment attachment mechanism I need is pushdown which, conveniently, requires relatively little core architecture.