adams85 / acornima

Acornima is a standard-compliant JavaScript parser for .NET. It is a fork of Esprima.NET combined with the .NET port of the acornjs parser.
BSD 3-Clause "New" or "Revised" License
9 stars 3 forks source link

Preserving trivia #11

Open ChrML opened 2 months ago

ChrML commented 2 months ago

Original issue: https://github.com/sebastienros/esprima-dotnet/issues/391

Now that we're in a fork, I just wanted to ask if this is still the status quo. ;)

Reason I'm asking is that it would be superuseful for refactoring tools where you'd want to rewrite an expression, but keeping the original trivia around it.

adams85 commented 2 months ago

This parser is used in my web asset bundler and probably Jint will adopt it sooner or later too. Currently I consider these the main consumers. These don't need the extra info, that is, they don't really want to pay the performance penalty coming with it. So yep, the situation hasn't been changed. :)

However, it would be pretty cool to support the use cases of refactoring tools too as that would be a pretty unique feature (most of the JS parsers just implements plain ESTree, which doesn't include the necessary info). So I'm not against it - if we can come up with a solution that doesn't hurt the main use cases.

Without taking the perf. penalty, I can think of one viable approach: conditional compilation. We'd need two separate builds, i.e. two flavors of the parser + AST. Something like this:

[VisitableNode(ChildProperties = new[] { nameof(Id), nameof(Params), nameof(Body)})]
public sealed partial class FunctionDeclaration : Declaration, IFunction
{
    private readonly NodeList<Node> _params;

#if EXTENDED_AST
    public FunctionDeclaration(
        Token? asyncKeyword,
        Token functionKeyword,
        Token? generatorStarToken,
        Identifier? id,
        Token paramsOpenParenToken,
        in NodeList<Node> parameters,
        Token paramsCloseParenToken,
        FunctionBody body)
        : base(NodeType.FunctionDeclaration)
    {
        AsyncKeyword = asyncKeyword;
        FunctionKeyword = functionKeyword;
        GeneratorStarToken = generatorStarToken;
        Id = id;
        ParamsOpenParenToken = paramsOpenParenToken;
        _params = parameters;
        ParamsCloseParenToken = paramsCloseParenToken;
        Body = body;
    }

    public Token? AsyncKeyword { get; }
    public Token FunctionKeyword { get; }
    public Token? GeneratorStarToken { get; }
    public Token ParamsOpenParenToken { get; }
    public Token ParamsCloseParenToken { get; }

    public bool Generator => GeneratorStarToken is not null;
    public bool Async => AsyncKeyword is not null;
#else
    public FunctionDeclaration(
        Identifier? id,
        in NodeList<Node> parameters,
        FunctionBody body,
        bool generator,
        bool async)
        : base(NodeType.FunctionDeclaration)
    {
        Id = id;
        _params = parameters;
        Body = body;
        Generator = generator;
        Async = async;
    }

    public bool Generator { [MethodImpl(MethodImplOptions.AggressiveInlining)] get; }
    public bool Async { [MethodImpl(MethodImplOptions.AggressiveInlining)] get; }
#endif

    /// <remarks>
    /// Diverging from the ESTree specification, <see langword="null"/> is used to indicate an anonymous default exported function (instead of introducing <see langword="AnonymousDefaultExportedFunctionDeclaration"/>).
    /// </remarks>
    public Identifier? Id { [MethodImpl(MethodImplOptions.AggressiveInlining)] get; }
    /// <remarks>
    /// { <see cref="Identifier"/> | <see cref="ArrayPattern"/> | <see cref="ObjectPattern"/> | <see cref="AssignmentPattern"/> | <see cref="RestElement"/> }
    /// </remarks>
    public ref readonly NodeList<Node> Params { [MethodImpl(MethodImplOptions.AggressiveInlining)] get => ref _params; }

    public FunctionBody Body { [MethodImpl(MethodImplOptions.AggressiveInlining)] get; }
    StatementOrExpression IFunction.Body => Body;

    bool IFunction.Expression => false;

    private FunctionDeclaration Rewrite(Identifier? id, in NodeList<Node> @params, FunctionBody body)
    {
        return new FunctionDeclaration(id, @params, body, Generator, Async);
    }
}

Obviously, conditional parts need to be added to the parser as well (plus the source generator need to be adjusted to conditional compilation), which can become pretty ugly, but in theory this is all doable.

I, however, have no incentive to implement this at the moment. But always welcome contributions. ;)

ChrML commented 2 months ago

I see. I'm quite sure it's technically possible tho without much performance penalty as long as it can be opted-in. E.g. the way Roslyn built their data, which is very fast.

But I see the argument of no incentives to do it. I can see if I can suggest something later if I start to use it.

adams85 commented 2 months ago

I'm quite sure it's technically possible tho without much performance penalty

It's definitely possible, the key question is how big that performance penalty will be. Because of consumers like Jint it must be close to zero. Which means not only execution time but also the memory consumption of the parser and the memory footprint of the resulting AST.

I can see the following possible ways:

  1. Compile-time solution: two separate builds of the library via conditional compilation, as mentioned in my previous comment. That can be done with zero penalty in exchange for littering up the code with preprocessor directives.
  2. Run-time solution: introducing a parser option and check for that in the parser code. Which would mean a lot of if branching, i.e., certainly a greater than zero penalty. Plus, the code would be muddied up nearly like in the case of Solution 1. And we'd also need to think about the memory footprint of the AST. Though that could be worked around using the internal data structure that we already have in place for storing extra data.
  3. Another hypothetical run-time solution (though I'm not sure it's a viable one): using the ParserOptions.OnToken and ParserOptions.OnComment callbacks it's possible to collect the token and comment data. Using all this + the AST it might be possible to calculate the trivia info on demand, after parsing. For example, if you want to get the token data for the function keyword of an async function, you can do the following: the function node's location info tells you at what position the function starts. You look up the first token whose position >= function start position. That must be the async keyword token since we're talking about an async function. So you need to get the following token, that must be the function keyword. (Assuming that it's not a generator function - but you can easily learn that by looking at the function node's Generator property). Once you have the location info of the function keyword, you can calculate the surrounding trivia info. For the trailing trivia, you get the next token from the token list. Between the end position of the function keyword and the start position of the next token there could be whitespace parts and comments only. From the comment list you can look up the comments. The whitespace trivia will be the parts outside the comments.

To sum it up, I'd suggest checking the viability of Solution 3 first. It would require some post-parsing computation but it would have a big advantage: there would be no need to alter the AST and the code of the parser. What I'm not sure about is that it's doable for all the possible syntax constructs. Determining that would need some research.

If this approach is not viable or you don't like it, I'd still discourage chasing Solution 2. Chances are that it would lead to a dead end. In that case the only viable way I can see is Solution 1. Unless, of course, you can come up with another one that meets the criteria laid down above. Taking a look at Roslyn would be definitely useful, it might give us ideas I can't see now.

Anyway, if you decide to take a stab at this nice problem, just let me know if you need any guidance.

CharlieEriksen commented 1 month ago

Also just adding a vote here. For me, I really wish that comments were a part of the tree. I have a use-case where I'm taking minized code and prettifying it. But in the process of going AST-to-string, the prettified code loses the comments.

I did look into it back in 2022, but was not convinced that this was gonna be easy to solve: https://github.com/sebastienros/esprima-dotnet/issues/39

At some point, I'll probably be desperate enough to look into this more in-depth. I think when considering it, what sticks out for me is:

  1. There's a question about "correctness". How close to correct is acceptable?
  2. Are there any currently known shortcomings that will be inherent with option 3?
  3. Any known pitfalls for option 3, like paths that may seem appealing at first but will prove impossible due to non-obvious reasons?
  4. I am not super familiar with the AST-to-string part of things. If the data gets attached to the node, are there any obvious pitfalls here to consider?
adams85 commented 4 weeks ago

Hey @CharlieEriksen,

For sure, this will be a tough nut to crack, no matter which path we go down.

This is my take on your points:

  1. I don't think I would include a solution other than one that provides full fidelity in this repo.
  2. Extra data can't be included in the AST nodes directly (or at least, it can only be done via the UserData property) because this whole thing should go into the Extras package. This also implies a second heap allocation for each node, which obviously means performance penalty (increased execution time and memory consumption).
  3. None at the moment. But it's very hard to foresee issues without deeper exploration. When I find some time, I'll make an attempt at laying down the foundation of option 3, which will hopefully help with identifying potential blocker issues.
  4. Parsing the code with full fidelity is one thing, but generating code from such an AST is another... What I can say at the moment is that I don't consider the generator part that performance sensitive, so probably it'd be ok to bake the extra logic into it directly. Good news is that the writer component is more or less already prepared for dealing with trivia. Nevertheless, unforeseeable difficulties are also possible here.
CharlieEriksen commented 3 weeks ago

I had a bit of an interesting thought: How does prettier do this? https://github.com/prettier/prettier/tree/main/src/language-js

It's interesting to notice that prettier actually has the option of using acorn as its parser. And for my specific use-case, I'm quite OK with the output format being opinionated.

So once I have more time to look into how prettier does things, I have a feeling that it might be a good starting point to investigate from.

ChrML commented 1 week ago
  1. Extra data can't be included in the AST nodes directly (or at least, it can only be done via the UserData property) because this whole thing should go into the Extras package. This also implies a second heap allocation for each node, which obviously means performance penalty (increased execution time and memory consumption).

About this, to avoid introducing extra processing, one could have a separate data-store for storing trivia information. Something like (not exactly, but to give an idea):

public interface ITriviaStore
{
    void SetTriviaBefore(AstNode node, ReadOnlySpan<char> trivia);
    void SetTriviaAfter(AstNode node, ReadOnlySpan<char> trivia);
}

This way, no extra data is added to the nodes, but you could have a interface that is 'null' when you don't need this data, and the parser behaves exactly like before. And you could have a predefined implementation that stores this data in a map if you need it. A code writer could also take an optional 'ITriviaStore' and use it to fetch the trivia in order to preserve it.

It's a bit more expensive when you need trivia data than storing them directly in the nodes, but the extra cost is next to nothing when you don't need it.

adams85 commented 1 week ago

I agree that we'll need some kind of abstraction for this feature to encapsulate and hide the actual data structure that stores the required data.

My comment, however, wasn't about the abstraction but about storage. Somehow we need to associate the AST nodes with the extra data. I can't see any other options than these:

  1. Dictionary<Node, ExtraData>
  2. ConditionalWeakTable<Node, ExtraData>
  3. Node.UserData
  4. Adding a dedicated field to Node. (Listed only for completeness. I've hesitated to introduce a dedicated field for storing the parent node reference, so probably I'll be even less inclined to add a field for storing a reference to trivia data.)

The only option of these that doesn't require a heap allocation per AST node is dictionary because dictionary value can be a value type as well. Nowadays you can access dictionary values without copying-0)). So, from a performance perspective Dictionary (or maybe FrozenDictionary) looks to be the best option.

About the abstraction:

As discussed above, to compute whitespace trivia, we'll certainly need the token and comment data besides the AST nodes. The parser provides these on demand, via the OnToken and OnComment callbacks.

So, I can think of some extension methods in the Extras package like this:

public static class ParserExtensions
{
    public static FullSyntaxTree<Script> ParseScriptFull(this IParser parser, string input, string? sourceFile = null, bool strict = false);
    public static FullSyntaxTree<Module> ParseModuleFull(this IParser parser, string input, string? sourceFile = null);
    /* etc. */
}

public abstract class FullSyntaxTree
{
    public Node Root => GetRoot();
    protected abstract Node GetRoot();

    // We'd store these as arrays internally so we can return readonly references to the items.
    public IReadOnlyList<Token> Tokens { get; }
    public IReadOnlyList<Comment> Comments { get; }
}

public sealed class FullSyntaxTree<TRoot> : FullSyntaxTree where TRoot : Node
{
    public new TRoot Root { get; }
    protected override Node GetRoot() => Root;
}

Now, this class can implement something like ITriviaStore.

Alternatively, I can also imagine extension methods for AST nodes as well. For example:

public static class AstExtensions
{
    public static TriviaList LeadingTrivia(this Node node, FullSyntaxTree syntaxTree);
    public static TriviaList TrailingTrivia(this Node node, FullSyntaxTree syntaxTree);

    public static ref readonly Token AsyncKeyword(this FunctionDeclaration node, FullSyntaxTree syntaxTree);
}

This would allow us to write code like this:

var asyncKeyword = functionDeclaration.AsyncKeyword(syntaxTree);
var trivia = functionDeclaration.LeadingTrivia(syntaxTree);

What's your opinion on this? (Just to be clear, I'm talking about option 3: "another hypothetical run-time solution" mentioned in this comment of mine, which I'm still not sure is possible to do in practice.)

adams85 commented 1 week ago

Now I'm getting creative:


public abstract class FullSyntaxTree
{
    /* ... */

    public FunctionDeclarationTokens TokensOf(FunctionDeclaration node) => new FunctionDeclarationTokens(node);

    public TriviaList LeadingTriviaOf(in Token token);
    public TriviaList TrailingTriviaOf(in Token token);
}

public readonly struct FunctionDeclarationTokens
{
    private readonly FullSyntaxTree _syntaxTree;

    public FunctionDeclarationTokens(FullSyntaxTree syntaxTree) => _syntaxTree = syntaxTree;

    public ref readonly Token AsyncKeyword { get; }
    public ref readonly Token FunctionKeyword { get; }
    public ref readonly Token GeneratorStarToken { get; }
    public ref readonly Token ParamsOpenParenToken { get; }
    public ref readonly Token ParamsCloseParenToken { get; }
    public ref readonly Token ParamSeparatorCommaToken(int index);
    /* etc. */
}

which allows

ref readonly var asyncKeyword = ref syntaxTree.TokensOf(functionDeclaration).AsyncKeyword;
var trivia = syntaxTree.LeadingTriviaOf(asyncKeyword);

In the meantime I realized that it doesn't make much sense to query trivia info for AST nodes. It does for specific tokens (just think about cases like async /* comment */ function() { ... } ).

Another remark: by returing ref readonly we can't really communicate whether the token is optional (plus the consumer would need to do ugly Unsafe.IsNullRef calls on the returned references), so probably returning something like this would be a better choice.

We should also keep an eye on the upcoming extension everything C# feature. It might come in handy to reduce the boilerplate.

But I don't think it's the API design that's going to be the hard part here.