dlang-community / libdparse

Library for lexing and parsing D source code
https://libdparse.dlang.io
Boost Software License 1.0
115 stars 57 forks source link

`Token.sizeof` has exploded #410

Closed ghost closed 1 year ago

ghost commented 4 years ago

:boom:

WebFreak001 commented 4 years ago

could you please elaborate? It's 112 bytes on 64 bit linux right now. The last change to Token (adding trivia) has added 2 arrays, adding 32 bytes.

It's using 32 bytes for two cache strings which were before just comment/trailingComment, but now they are a trade-off for avoiding a lot of GC usage on repeated comment/trailingComment calls.

When using the getTokensForParser data it will allocate the two trivia arrays with a lot of whitespace and comment tokens, which are currently forced on. This could probably be lifted to actually use the LexerConfig parameter it already gets from the application, but if you have one place where you store your tokens (some cache) in an IDE-like application then you will probably keep them enabled for allowing all kind of source operations anyway.

ghost commented 4 years ago

There are several solutions, here are 3 sorted by personnal preference.

  1. You could move trivia to an external registry (some might rather call this a "store") and in the Token add a size_t index giving the registry entry.
  2. In the AST don't use Token but Token*.
  3. Use linked lists instead of arrays.

The problem is that the many nodes store a token and most of the time the associated comments are of no usefulness.

WebFreak001 commented 4 years ago

in the AST, when using the common Token[] tokens; slice it's basically a pointer. For the AST this is optimal because it just keeps the slices from the input data and doesn't copy them, which is 16 bytes per AST node which can be argued against for all the functionality it allows for.

There are still legacy fields where the token was manually assigned, now it might be smarter for trivial things to just add property functions (return ref Token) and use the tokens slice. The tokens slice is fairly "recent", so I don't think any classes inside ast.d utilize it, it's more for application users.

Could you please rather describe the problem you have?

ghost commented 4 years ago

No I mean where there's a field, in a node, just like those in PrimaryExpression:

final class PrimaryExpression : ExpressionNode
{
    override void accept(ASTVisitor visitor) const
    {
        mixin(visitIfNotNull!(basicType, typeConstructor, type, primary,
                typeofExpression, typeidExpression, arrayLiteral, assocArrayLiteral,
                expression, dot, identifierOrTemplateInstance, isExpression,
                functionLiteralExpression,traitsExpression, mixinExpression,
                importExpression, vector, arguments));
    }
    /** HERE */ Token dot;
    /** HERE */ Token primary;
    /** */ IdentifierOrTemplateInstance identifierOrTemplateInstance;
    /** HERE */ Token basicType;
    /** */ TypeofExpression typeofExpression;
    /** */ TypeidExpression typeidExpression;
    /** */ ArrayLiteral arrayLiteral;
    /** */ AssocArrayLiteral assocArrayLiteral;
    /** */ Expression expression;
    /** */ IsExpression isExpression;
    /** */ FunctionLiteralExpression functionLiteralExpression;
    /** */ TraitsExpression traitsExpression;
    /** */ MixinExpression mixinExpression;
    /** */ ImportExpression importExpression;
    /** */ Vector vector;
    /** */ Type type;
    /** HERE */ Token typeConstructor;
    /** */ Arguments arguments;
    mixin OpEquals;
}

associating the comments like this was done was an error according to me.

WebFreak001 commented 1 year ago

more tools based on libdparse have started using the trivia for its intended purposes, e.g. recreating code from modified input tokens, being able to modify tokens without changing whitespace or comments, as well as projects such as documentation parsers accessing comments of unexposed parts, as well as code looking at tokens outside the AST node itself by going over the bounds of the tokens slice (which is valid as long as you ensure you stay within the original tokens slice)

So I think the API change was a good success, unlocking many nice additional features in third party apps that don't require going through long PR processes within libdparse. The memory usage is being tackled by improving the AST especially, which takes much more of the memory than the tokens.

So closing this as wont-fix.