benjamn / recast

JavaScript syntax tree transformer, nondestructive pretty-printer, and automatic source map generator
MIT License
4.98k stars 348 forks source link

Recast needs additional maintainers #1086

Closed conartist6 closed 2 years ago

conartist6 commented 2 years ago

@benjamn I am trying every way that might be possible to get your attention, because if I can't I'm forking the library.

conartist6 commented 2 years ago

I think I'm just going to start work on the fork. It's macrome-js/macrome that I'm building which fundamentally relies on recast, so perhaps macrome-js/recast and @macrome/recast would be appropriate.

conartist6 commented 2 years ago

Currently my organizations consist solely of me, but I've created them in the explicit hope that I can bring others in and form communities of trust and shared responsibility.

conartist6 commented 2 years ago

I'm going to give this one day. Let's see how far out ahead I can get my fork in just one day of work.

eventualbuddha commented 2 years ago

@conartist6 I suggest talking with @benjamn about helping maintain recast. If you have a stake in it, then you're clearly more motivated.

conartist6 commented 2 years ago

I've been trying rather aggressively, but no luck. I think the most helpful thing I can do is make my fork and show him what my direction would be if I had control over the package. I'm working on that right now.

I'm getting my fork ready to be a 1.0.0 release. I'm dropping support for everything that isn't absolutely essential.

I'm going to kick all the parsers out of the repo as they really need to be their own packages in order to configure deps correctly. I'll recommend babel-plugin-recast for babel users. No more one-size-fits-all parser options either.

I'm stripping away everything that isn't what the library is about to reduce the work involved documenting, maintaining, and versioning. No more run or runFile or runString. That is for other tools and libraries to build (and document). I'm not even exporting ast-types anymore because I don't want to merge those API surfaces and have to worry about which version is exported.

conartist6 commented 2 years ago

Or maybe I should just change parsers to a different signature, something like parser => parser. I don't think recast should have the responsibility of actually requiring the parser, it just needs a way to let users specify parser options they want while recast dictates certain parser options that it needs.

conartist6 commented 2 years ago

I've also ripped out the dependency on acorn as a default parsers. I just error now if your parser doesn't define tokens. I think most people will want to use @babel/parse, but I don't think recast should have dependencies on any particular parser at all. That probably means that recast shouldn't have a parse method at all but rather a wrapParse method. Again, stripping away the magic makes the library that much easier to document and maintain.

conartist6 commented 2 years ago

Here's how I imagine usage looking roughly:

import { parse } from '@babel/parse';
import { wrapParseWithFormatting, printFormatted } from 'recast';

const parseFormatted = wrapParseWithFormatting(parse, 'babel');
const ast = parseFormatted(source, options);
// ast mutations
return printFormatted(ast);
conartist6 commented 2 years ago

Replacing delete ... with ... = undefined as delete is a known perf-killer.

conartist6 commented 2 years ago

Making the tests javascript instead of typescript. // @ts-check ensures they still get some basic benefits from typescript checking while devs get a few less things to worry about.

conartist6 commented 2 years ago

Typescript doesn't understand self-references in js files (bug filed), and I'd prefer having those to having typescript. I think cleaning up the tests is going to be the biggest chunk of this work. The test cases need to be reorganized so that as many test cases as possible are expressed in a manner that is agnostic to the specific parser and ast structure being used.

conartist6 commented 2 years ago

Actually I'm going to do the minimum amount of work necessary to get the test suite to pass then push, merge and push my fix for the parens issues, and continue working. I don't want to fall into the trap of verifying behavior which is wrong.

conartist6 commented 2 years ago

Another thing I'm thinking about is that there's just very little left that recast is needed for at all. Recast has a lot of pretty-printing logic baked into, and these days all that really belongs in prettier. All that recast really does in an environment with pretty-printing is to ensure that certain line breaks remain untouched because prettier makes certain decisions about retaining its original formatting by comparing ast node locations to the source string.

I think the best thing to do for the long run would be to help prettier avoid dependence on the original source text so that the workflow would be:

And then we get to deprecate recast! I've created an issue to discuss the changes to prettier that would be necessary.

eventualbuddha commented 2 years ago

I can't say I disagree with most of your thoughts here. I also wonder about the future of recast in a world where many (most?) use Prettier.

That said, I decided to spend a little time getting things in slightly better shape and released v0.21.1: https://github.com/benjamn/recast/releases/tag/v0.21.1. If there's any specific PR you want me to prioritize, let me know and I'll hopefully get to it this week.

conartist6 commented 2 years ago

Thank you!! Are you interested in my putting together PRs for the API changes I mentioned as being a potential 1.0.0? Even if the long term is to move away from recast, I think there's a strong argument to shipping a 1.x branch first. Don't wanna end up on the 0ver list!

conartist6 commented 2 years ago

Also would it be possible to get at least some basic perms on this repo, like triage? If I had that I could take a proper pass through the open bugs and see what's legit.

gnprice commented 2 years ago

I just wanted to chime in that I really appreciate the work that goes into maintaining recast :heart:

Even in a possible future where the Prettier maintainers respond enthusiastically to prettier/prettier#12709 and Prettier grows the ability to support writing codemods, there will be a lot of codebases out there that don't use Prettier. I appreciate being able to transform that code without completely reformatting it, meeting the code where it is. Recast is a focused tool that does the job of enabling transforms without making a lot of assumptions, and I think there's a lot of value in that.

gnprice commented 2 years ago

On the subject of specific PRs, I would appreciate a look at #1089 and #1090 :-). Those are new -- I just sent them this week.

@conartist6 to your thought at https://github.com/benjamn/recast/pull/1068#issuecomment-1112209541 about test cases, I quite like the structure of test/flow.ts after #1089 (and it's mostly not my doing, I only made small changes there.) An individual test case there takes just a line or two to express, and I think that helps a lot in being able to write a thorough set of test cases. So you might find that part interesting to take a look at.

One thing that might be productive for someone to do would be to put some of the other tests into the same style, and then just churn out a few dozen more one-line test cases for areas that seem to be tricky, like extra/missing parens. (Hmm, in fact: test/parens.ts is the other file that's closest to that structure already -- in fact most of it is basically exactly like the existing version of test/flow.ts, which is already quite good for adding more cases. So that's convenient.)

conartist6 commented 2 years ago

One thing's for sure, both recast and prettier are super high-maintenance projects which need continuous fixes and improvements from a community of active maintainers to keep them healthy.

My impression is that both are fundamentally a bit unstable due to exceedingly high costs in two major dimensions:

Prettier is the more stable project just because it has so many more users and it is run so much more often.

The language changing is pretty much a constant, so what I think could use the most attention is some way to curb the cost of supporting so many different parsers. Does anyone else have ideas about this?

For now I'm going to reopen this issue as I think it's still relevant and there's productive discussion to be had about direction and combining efforts that have started independently.

conartist6 commented 2 years ago

One way to reduce maintenance costs would be to try to extract as much common functionality as possible between prettier and recast and share it. needsParens is a great example of something that would best be shared between the two libraries, if they could standardize on the arguments to the function and what its responsibilities are or are not. I'm sure there are more.

gnprice commented 2 years ago

I agree that there's fundamentally a bit of ongoing instability in that the language changes.

But I think the volume of those changes is not actually all that high. And when a feature is new, or especially when it's an early proposal, there's not a lot of code out there that uses it yet -- so for the great majority of users, it's perfectly fine if the tool isn't quick to gain support for that feature, because none of their code uses it. If the feature becomes widespread, someone will be motivated to add it.

One thing I've appreciated about Recast is that it's not a lot of code. My experience with #1089 and #1090, where I add or fix support for some less-widespread (because they're specific to Flow) language features, was that it wasn't a lot of work to do so.

(FWIW I think fixing something like where parens do or don't get added is inherently trickier, because it's logic that cuts across wide swathes of the language. But it's also not something where the language really changes over time. The existing code doesn't need a fix because the language changed, but rather because it wasn't quite right in the first place. A really spot-on solution for that, if/when we nail one down, shouldn't inherently have any ongoing need for maintenance.)

gnprice commented 2 years ago

I am curious about the possibility of simplifying by supporting fewer parsers. It looks to me (from looking in parsers/) like there are three different parsers -- Babel, Acorn, and Esprima. (Then three different choices of what Babel plugins to use, called babel, babel-ts, and typescript; plus babel has two aliases, flow and babylon. But I imagine those are all pretty well aligned with each other in what output they give, for the language features they have in common.)

What do the differences look like that add complexity to the Recast code?

I also don't know enough about the context of the various parsers to understand what value some people might be getting from different ones. Personally I believe I wouldn't miss anything if both the non-Babel parsers were dropped; but I don't really know the history of the different parsers, or what use cases people might have had in the past, and might still have, that the others serve and Babel didn't or doesn't.

conartist6 commented 2 years ago

These are excellent points. I'm not so worried about the language itself either. I am a bit more concerned about things like JSX, Typescript, and Flow which are not the language (but whose syntax you must handle in tools). These even complicate the experience of writing a codemod, for example using the babel AST because you may be writing a mod for plain javascript codebase and the type checker will be telling you that you need to handle nodes for type annotations and JSX.

I'm going to dig more into what the differences are and whether there are better ways of abstracting them away. Prettier has a postprocess step that does a lot of mangling to try to clean up some of the main differences. That's actually another bit of code that prettier could conceivably share with recast.

conartist6 commented 2 years ago

Also after having read some more or prettier's logic I think the biggest thing to abstract would be a representation of the token stream. Recast demands that it be given tokens to work (though I haven't yet found the locations where they are used). Prettier already essentially relies on being able to treat the source text like a token stream, looking around for particular tokens like { or ( or =>. Prettier also doesn't trust the parser's representation of comments at all, instead preferring to grab the complete comment text from the source by slicing it on using the start/end of the comment node.

Since I'm always thinking in iterables, I think it might be possible to write a token generator which can generate tokens by traversing an AST. That way it would still be possible to work with an AST but easily write rules that involve source ordering. Also one of the benefits or writing a generator is that without the need to store the data you can parameterize the way you generate it, e.g. as function* nextTokens(node, includeWhitespace = false) { /*...*/ }. I think such a method could power most of the checks that prettier currently needs to make against the source text.

conartist6 commented 2 years ago

Oh wasm is another thing that I learned (from reading prettier sources) introduces some new syntaxes that some parsers handle and some do not, e.g. import {"string" as name} from "source".

conartist6 commented 2 years ago

OK I think I feel pretty confident though that prettier should be my starting point. Prettier's starting point was recast, and they've come a very long way since then. I want to adopt their solutions to the problems that recast still struggles with. I'm finding it very interesting to look at some of the oldest history of prettier. Here is what is essentially its initial commit. Some of the content of that readme is documentation about its mechanism that does not exist for current versions though the mechanism remains the same!!

benjamn commented 2 years ago

@conartist6 I’m here. You have my attention. I’m sorry I haven’t been able to prioritize a proper response until now.

Would you mind summarizing what you need from Recast in order to use it without having to fork it (also fine/allowed, just hopefully unnecessary)?

conartist6 commented 2 years ago

Oh excellent, hello! I'm not ready to say yet. I want to see what I find as I continue my deep dive into prettier. It makes a lot of sense to me to try to cut down on the duplicated logic between the two libraries so that recast's functionality actually benefits from the maintenance efforts that go into prettier.

I'm "the iterable guy" so I'm currently thinking that it might be a powerful abstraction for both libraries to define printing as defining an iterable of tokens. The great thing about that abstraction is that it's easy to define transformations over the stream. For example, formatted printing is just the processes of inserting whitespace between the tokens. Both recast and prettier essentially consist of algorithms for determining how much whitespace goes where, so I think it would be very powerful to have an abstraction that captured that.

EDIT: ok prettier's data structure is a tree. But that doesn't mean recast couldn't use it.

conartist6 commented 2 years ago

I'm curious to hear from you if you think sourcemaps are still a major part of recast's value proposition... I'm personally focused on making it easier and more reliable to write codemods, so I could care less if the solution I come up with supports sourcemapping.

eventualbuddha commented 2 years ago

@conartist6 as the author of https://github.com/codemod-js/codemod, I'm also interested in making codemods easier. I'm curious to hear your thoughts on it, though I've been thinking about it less from an "under the hood" perspective and more the API for end users to write the codemod itself.

conartist6 commented 2 years ago

@eventualbuddha Oh cool, taking a look. I'm on the same page with you about babel's APIs being the ones that you really need to be able to do serious codemodding work. You might be interested in what I was working on immediately before this.

My main project that's directly relevant is macrome, a build system that allows you to keep generated output continuously up to date even as input and the transforms themselves change. Its aim is to provide robust clean and watch operations that "just work", and to help users bake all the magic out of their code so that they have the complete logic of their application expressed in plain, static javascript (maybe even checked into VCS). No more complicated bundler configs to parse all that wacky fake syntax!

Imagine being able to use CSS modules but instead just having a magic syntax it actually generates the file containing the mappings between importable selectors and their secret hashes.

Or imagine being able to use macros to inject a stable ID into your source code. You could write code like this:

import uid from '@macrome/uid.macro';
translation(uid``, 'description');
// and the build system would change it to:
translation(uid`e9af0`, 'description');

This solves so many problems around knowing what is and isn't the same translation across multiple version of the code without offering any particular opinion about how you build your translation system. It would also be very relevant for generating stable error codes that could be used to link to a knowledgebase!

conartist6 commented 2 years ago

Long story short, I don't want to end up maintaining recast. In order to be able to execute on such an ambitious vision, I need modifying and printing code to also "just work" so that I can build on top of it without spending a lot of my time on maintenance of this library.

conartist6 commented 2 years ago

OK, I think I've got my recommendation. It is: embrace concrete syntax trees by adding node.tokens to every node. To allow nesting of the structure without duplicating the existing tree structure, intersperse the existing token types with a Reference type which contains the name of a reference to another property of node as its value. This ensures a single source of truth, and will allow me to do a lot of work for everyone in handling edge cases.

Let's say you had a source like this: (astexplorer)

if (true) {
  consequent;
}

In my CST representation it would look like:

{ // = node
  "type": "IfStatement",
  "test": {
    "type": "Literal",
    "value": true,
    "tokens": [
      { "type": "Boolean", "value": "true" }
    ]
  },
  "consequent": {
    "type": "BlockStatement",
    "body": {/*...*/},
    "tokens": [
      { "type": "Punctuator", "value": "{" },
      { "type": "Whitespace", "value": "\n  " },
      { "type": "Reference", "value": "body" }, // interpret as node.body.tokens
      { "type": "Whitespace", "value": "\n" },
      { "type": "Punctuator", "value": "}" }
    ]
  },
  "tokens": [
    { "type": "Keyword", "value": "if" },
    { "type": "Whitespace", "value": " " },
    { "type": "Punctuator", "value": "(" },
    { "type": "Reference", "value": "test" },
    { "type": "Punctuator", "value": ")" },
    { "type": "Whitespace", "value": " " },
    { "type": "Reference", "value": "consequent" }
  ]
}

I think I can provide the equivalent of @babel/traverse over this structure in such a way that I can make the AST the real source of truth if ever it should disagree with the tokens array. Thus the basic add, remove, delete operations would be covered. If you add a node, the tokens iterator would generate a minimal set of correct tokens from the AST. If you move a node its tokens move with it. If you remove a node its tokens are removed with it. If you get fancy with the inside of a node such that the tokens don't make any sense any more, the default printer would ignore the garbage and generate new tokens. I can also provide operations to validate, rebuild, or strictly print the token data.

The idea would be to provide a rich API, document it, and then see if I could get enough momentum to perhaps merge with @babel/traverse and slog through the process of refactoring prettier.

At the end though you'd have exactly what I want: a way that things can interoperate and "just work" without the explicit need to maintain a library like recast.

I'm going to work on building the library here. It's gonna be a bit of a bear because it actually needs what is essentially a custom parser over a token stream so that it understands node-by-node both how to generate tokens and how to verify that the existing tokens constitute legal syntax. Fortunately I've been experimenting with various kinds of syntax matching architectures...

conartist6 commented 2 years ago

Here's a "little" piece of code that shows how I think a token management engine would look and work:

EDIT: I have moved the prototype to a gist since I keep changing it.

conartist6 commented 2 years ago

Hmm, and maybe I could have one more kind of builder, SourceBuilder which would use loc or range information to construct tokens from the source text during initial parsing before any AST modifications have begun (much like prettier assumes it is safe to do). That way this approach would be compatible with any existing parser, as recast is.

conartist6 commented 2 years ago

Well, that's it. I don't want to burden this thread with all my dev work. I talk a lot. I think this shows pretty clearly a concrete proposal -- essentially a rearchitecture of recast. I could call it something else, or I would also be perfectly happy if there was interest in publishing this under the recast name. I'd sure love to have help building it. My hope would be that building this would make it possible for prettier to eventually depend on recast again, as this structure supports all the kinds of queries prettier needs to make. If this did happen, I don't think recast would be at a loss for maintainers anymore.

conartist6 commented 2 years ago

I really like it because it'd be an implementation of recast with absolutely no logic about pretty-printing, the incomplete implementation of which is one of recast's current maintenance burdens.

Also in addition to making a best effort to preserve formatting, it permits easy creation of formatting-aware transforms: those which also manipulate the tokens array in such a way that it remains valid for the given AST.

conartist6 commented 2 years ago

I now have a complete implementation for a small grammar in the repo. I know some of my examples contained magical thinking, but this doesn't. It works!

benjamn commented 2 years ago

This is a lot to digest 😅, but I want you to know I am digesting it @conartist6. In particular, I agree it would be convenient if Recast could output tokens rather than (or in addition to) a string of code.

Some initial thoughts, from a first glance:

Instead of providing a tokens property on every output AST node, which might overlap significantly with the tokens arrays of descendant nodes, I think it might make more sense to have one global list of output tokens, with each AST node indexing into that list with a property like node.tokenRange which holds numeric [start, end] pairs. That seems like it would be a lot cheaper (less redundant) overall, and still relatively easy to serialize to JSON (in case that matters)?

I really like it because it'd be an implementation of recast with absolutely no logic about pretty-printing, the incomplete implementation of which is one of recast's current maintenance burdens.

I agree keeping up with AST and parsing/printing changes is an ongoing chore, potentially never finished, and I know I've fallen behind. But I'm not sure I'm following how the CST output tokens idea would save Recast from needing a pretty-printer. Generating a stream of concrete output tokens (including whitespace and comments) seems pretty much equivalent to full pretty-printing, with all the choices that need to be made about indentation and parenthesization, etc. Some of the token ranges could be copied directly from the original source token stream, but the ones that Recast generates will need to be formatted by some sort of pretty-printing-like process, right? (Not a rhetorical question—please let me know if I'm misunderstanding something!)

In more general terms, I think it would be valuable if Recast could be configured to delegate pretty-printing to another library (as it can already delegate parsing), even if it still needs to maintain its own default pretty-printer. The challenge is that pretty-printing is only part of the conservative reprinting process, so the other printing library needs to have a certain kind of structure that allows alternating recursively between pretty-printing and just copying source code. Since Prettier was originally a fork of the Recast pretty-printer, it's probably the closest to having that structure already (passing a print callback to the genericPrintNoParens function, IIRC).

benjamn commented 2 years ago

In case this isn't already shared understanding: I strongly believe Recast users should not have to worry about manipulating tokens while doing their AST transformations. If we make Recast capable of generating output tokens, it needs to be an automatic process based entirely on inferred AST changes, to avoid increasing the implementation burdens of AST transforms.

One more question for async discussion: can we enumerate the benefits of having Recast generate tokens directly, rather than just tokenizing the output string that it already prints? I imagine it might be faster to skip the re-tokenization, though tokenization is fairly quick compared to full parsing. I ask because there might be a quick and dirty implementation here where Recast reuses its parser somehow to tokenize the output string. A little slower, perhaps, but it might be enough to unblock downstream experimentation, if output tokens are ultimately what you need?

conartist6 commented 2 years ago

@benjamn Thanks for taking the time to look through this all and start thinking about it! I know I have a tendency to race off ahead in my own direction, but it's the only way to find out in a concrete way what is possible and what problems would be encountered.

conartist6 commented 2 years ago

I really think it makes a lot of sense to have the tokens arrays on the nodes. You mention the possibility of tokenRanges, but that is essentially not much different than what we have now -- ranges into sourceText. In particular it means you'll be unable to do anything like inserting tokens because you'd upset all those indexes referencing into the array.

What i've built ensures that code transforms are able to add or modify tokens if they want to, but also ensures that they need not do so, that a reasonable default set of tokens will be provided.

conartist6 commented 2 years ago

About deciding which printing algorithm to apply to different parts of the AST, have you seen this part of my code?

conartist6 commented 2 years ago

As for tokenizing an output string, I'm not really thinking in that direction. In the world I'm imagining there would be no need for a pure string representation until all the transformation work was done. That's why I also want prettier to offer a compatible formatCst(cst) method.

conartist6 commented 2 years ago

Something cool about the way my code works: it falls back from one printer to another. Suppose you have:

{
  key1: 'value1',
}

And you write a codemod whose purpose is to change that to

{
  key1: 'value1',
  key2: 'value2'
}

You do this by pushing to the ObjectExpression.properties array. Suppose you don't update the tokens attached to the ObjectExpression (they are: ['{', '\n ', 'ref:properties', ',\n', '}']). Now when you go to print the updated AST you're going to fail reprinting the existing tokens -- you'll find '}' where you need to see a second 'ref:properties'. This will force you into fallback which generates only the necessary tokens and the immediate result will be the ugly and uninentional:

{
  key1: 'value1',
  key2: 'value2'}

I agree that that looks bad, but it's actually great for feeding into a pretty printer (prettier, or one built into recast) because it retaines the most crucial formatting decision: whether the object should be printed on one line or more than one, represented as whether '{' is followed by '\n'.

eventualbuddha commented 2 years ago

Where does this fallback decision happen? How does your code tell that the new ObjectProperty means the tokens array is outdated?

conartist6 commented 2 years ago

Fallback is in the builders: https://github.com/conartist6/cst-tokens/blob/248c6dd60934840c13cabb5019ef050c68ae96b4/lib/builders/tokens.js#L23

conartist6 commented 2 years ago

The tokens array only becomes outdated if the number of properties changes. That's because I use the simple { type: 'reference', value: 'properties' } token as a placeholder for both an AST node and its tokens. I don't do anything fancy like value: 'properties[0]'.

I can tell if the token stream is outdated using my grammar which defines which combination of tokens are valid. Notice that whitespace and comments are allowed, but for an object property or specifier, references, commas, and curly braces are ensured (optional = false), which is to say that a fallback would occur when the ensured reference token for the second property was not found (or the comma if there had been no trailing comma).

conartist6 commented 2 years ago

I've made up some basic architecture documentation to try to get the basic principles of the design across quickly.

I'm also going to start experimenting with a mechanism for diffing references that would allow me to handle cases where it would be appropriate to go into fallback generation and then come back out of it within a single node.

That would allow me to handle some tricky cases like transforming:

import { /* a */ a, /* c */ c } from 'source';

// after inserting b currently you'd get the c comment attached to b:
import { /* a */ a, /* c */ b, c } from 'source';

// but b should use fallback tokens while c gets the ones that belong to it
import { /* a */ a,  b, /* c */ c } from 'source';
conartist6 commented 2 years ago

I've done more of my due diligence on prettier now too. I'm sure I haven't found every surprise, but I think I have the basics.

Essentially I have two related questions that I'm trying to understand the answers to:

  1. Can prettier use a CST to answer its queries about what the existing syntax is in a way that would allow it to work with altered trees? My impression is that the answer is yes. I would need to build a syntax-querying API that was equivalent to the APIs prettier currently uses, and then convince prettier to adopt it.
  2. Can prettier represent its output as mutations to a CST? I think the answer is a qualified yes, but first I need to convince them that they want to. They are faced with a problem currently: prettier mutates its input AST, most notably to turn (a && b) && c into a && b && c. They consider this mutation of the input AST to be a bug, because it means that any formatAst(ast) => string would have a mixture of pure (string) and impure (modified AST) output. I can offer them a solution to this problem they weren't previously considering: embracing impure output instead of pure output. They would have a formatCst(cst) => undefined method whose entire point would be to mutate the input tree. I can offer them the ability to use the tools I write to take the out array they build (containing string token values) and use it as a source by which to rebuild the node.tokens arrays in the input cst (which also has the ast modifications in it). This way the output is entirely as impure mutations, which is unambiguous. It's unclear as to whether they will really come around to my way of thinking on this as it would require them to adopt the CST format that I am proposing.