antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.07k stars 3.68k forks source link

[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream #4024

Open SuraiyaBegumK opened 5 months ago

SuraiyaBegumK commented 5 months ago

I created a parser in TypeScript for YAML like language but my language is more simpler one. I want to handle the indents and dedents when newline token occurs

Issue 1: Observed that everytime I try to get next token using super.nextToken function it is directly calling emitToken() fucntion and pushing a token in tokens array. Due to this token is getting pushed before the check needs to be performed (Example: Added condition to skip the white spaces using Skip() but token is already pushed before reaching this line)

Issue 2: When I create instance of Lexer I can see all the tokens those got pushed (Even unnecessary tokens: Issue 1 ) but after trying convert them into TokenStream using 'CommonTokenStream', I can't see the tokens I pushed in the tokens array but I noticed in side tokenSource there are tokens but unable to access them

image

After passing Lexer to CommonTokenStream

image image

Lexer Class in ANTLR4

`export declare class Lexer extends Recognizer {

static DEFAULT_MODE: number;

_input: CharStream;
_interp: LexerATNSimulator;
text: string;
line: number;
column: number;
_tokenStartCharIndex: number;
_tokenStartLine: number;
_tokenStartColumn: number;
_type: number;

constructor(input: CharStream);
reset(): void;
nextToken(): Token;
skip(): void;
more(): void;
more(m: number): void;
pushMode(m: number): void;
popMode(): number;
emitToken(token: Token): void;
emit(): Token;
emitEOF(): Token;
getAllTokens(): Token[];

} `

Logic implemented to Push Indents and Dedents token array

 `import { CharStream, Token, CommonToken, Lexer } from "antlr4"; import MyParser from "./MyParser";

export default class MyLexerBase extends Lexer { /**

`

kaby76 commented 5 months ago

for YAML like language

This bug applies to which grammar in this repo, grammars-v4? We don't have a grammar for yaml.

I recommend that you try the Antlr4ng tool and runtime. There is a TypeScript target for Antlr 4.13.1, but it is unlikely any fixes will be addressed in that code. You can then raise an Github Issue over there if you see a problem with the runtime. You will need to use the Antlr4ng-cli tool to generate the updated code for your parser.

SuraiyaBegumK commented 5 months ago

I have my own grammar I referred Python sample in this repo for Indent handling logic

// Override emit method to customize token emission if necessary emitToken(token: Token) { super.emitToken(token); this.tokens.push(token); }

where I am emitting Intent and Dedent custom tokens, they are not getting emitted

RobEin commented 5 months ago

I don't know if it helps. I use something like this to insert INDENT and DEDENT tokens for Python Lexers with TypeScript target. There is no emitToken().

import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import PythonLexer from "./PythonLexer";
import * as Collections from "typescript-collections";

export default abstract class PythonLexerBase extends Lexer {
    // A stack that keeps track of the indentation lengths
    private indentLengthStack!: Collections.Stack<number>;
    // A list where tokens are waiting to be loaded into the token stream
    private pendingTokens!: Array<Token>;

    // last pending token types
    private previousPendingTokenType!: number;
    private lastPendingTokenTypeFromDefaultChannel!: number;

    private curToken: CommonToken | undefined; // current (under processing) token
    private ffgToken: Token | undefined;       // following (look ahead) token

    protected constructor(input: CharStream) {
        super(input);
        this.init();
    }

    private init(): void {
        this.indentLengthStack = new Collections.Stack<number>();
        this.pendingTokens = [];
        this.previousPendingTokenType = 0;
        this.lastPendingTokenTypeFromDefaultChannel = 0;
        this.curToken = undefined;
        this.ffgToken = undefined;
    }

    public nextToken(): Token { // reading the input stream until a return EOF
        this.checkNextToken();
        return this.pendingTokens.shift()!; // add the queued token to the token stream
    }

    private checkNextToken(): void {
        if (this.previousPendingTokenType !== PythonLexer.EOF) {
            this.setCurrentAndFollowingTokens();
            if (this.indentLengthStack.isEmpty()) { // We're at the first token
                this.handleStartOfInput();
            }

            switch (this.curToken!.type) {
                case PythonLexer.NEWLINE:
                    this.handleNEWLINEtoken();
                    break;
                case PythonLexer.EOF:
                    this.handleEOFtoken();
                    break;
                default:
                    this.addPendingToken(this.curToken!);
            }
        }
    }

    private setCurrentAndFollowingTokens() {
        this.curToken = this.ffgToken == undefined
            ? this.getCommonTokenByToken(super.nextToken())
            : this.getCommonTokenByToken(this.ffgToken);

        this.ffgToken = this.curToken.type === PythonLexer.EOF
            ? this.curToken
            : this.getCommonTokenByToken(super.nextToken());
    }

    private handleStartOfInput() {
        // initialize the stack with a default 0 indentation length
        this.indentLengthStack.push(0); // this will never be popped off
    }

    private handleNEWLINEtoken() {
        const nlToken = this.curToken!; // save the current NEWLINE token
        const isLookingAhead = this.ffgToken!.type === PythonLexer.WS;
        if (isLookingAhead) {
            this.setCurrentAndFollowingTokens(); // set the next two tokens
        }

        switch (this.ffgToken!.type) {
            case PythonLexer.NEWLINE: // We're before a blank line
            case PythonLexer.COMMENT: // We're before a comment
                this.hideAndAddPendingToken(nlToken);
                if (isLookingAhead) {
                    this.addPendingToken(this.curToken!);  // WS token
                }
                break;
            default:
                this.addPendingToken(nlToken);
                if (isLookingAhead) { // We're on whitespace(s) followed by a statement
                    const indentationLength = this.ffgToken!.type === PythonLexer.EOF ?
                        0 :
                        this.getIndentationLength(this.curToken!.text);

                    this.addPendingToken(this.curToken!); // WS token
                    this.insertIndentOrDedentToken(indentationLength); // may insert INDENT token or DEDENT token(s)
                } else { // We're at a newline followed by a statement (there is no whitespace before the statement)
                    this.insertIndentOrDedentToken(0); // may insert DEDENT token(s)
                }
        }
    }

    private insertIndentOrDedentToken(curIndentLength: number) {
        let prevIndentLength: number = this.indentLengthStack.peek()!;
        if (curIndentLength > prevIndentLength) {
            this.createAndAddPendingToken(PythonLexer.INDENT, Token.DEFAULT_CHANNEL, "INDENT", this.ffgToken!);
            this.indentLengthStack.push(curIndentLength);
        } else {
            while (curIndentLength < prevIndentLength) { // more than 1 DEDENT token may be inserted to the token stream
                this.indentLengthStack.pop();
                prevIndentLength = this.indentLengthStack.peek()!;
                if (curIndentLength <= prevIndentLength) {
                    this.createAndAddPendingToken(PythonLexer.DEDENT, Token.DEFAULT_CHANNEL, "DEDENT", this.ffgToken!);
                } else {
                    // this.reportError("inconsistent dedent");
                }
            }
        }
    }

    private insertTrailingTokens() {
        switch (this.lastPendingTokenTypeFromDefaultChannel) {
            case PythonLexer.NEWLINE:
            case PythonLexer.DEDENT:
                break; // no trailing NEWLINE token is needed
            default:
                // insert an extra trailing NEWLINE token that serves as the end of the last statement
                this.createAndAddPendingToken(PythonLexer.NEWLINE, Token.DEFAULT_CHANNEL, "NEWLINE", this.ffgToken!); // ffgToken is EOF
        }
        this.insertIndentOrDedentToken(0); // Now insert as much trailing DEDENT tokens as needed
    }

    private handleEOFtoken() {
        if (this.lastPendingTokenTypeFromDefaultChannel > 0) {
            // there was a statement in the input (leading NEWLINE tokens are hidden)
            this.insertTrailingTokens();
        }
        this.addPendingToken(this.curToken!);
    }

    private hideAndAddPendingToken(cToken: CommonToken) {
        cToken.channel = Token.HIDDEN_CHANNEL;
        this.addPendingToken(cToken);
    }

    private createAndAddPendingToken(type: number, channel: number, text: string, baseToken: Token) {
        const cToken: CommonToken = this.getCommonTokenByToken(baseToken);
        cToken.type = type;
        cToken.channel = channel;
        cToken.stop = baseToken.start - 1;
        cToken.text = text;
        this.addPendingToken(cToken);
    }

    private addPendingToken(token: Token) {
        // save the last pending token type because the pendingTokens list can be empty by the nextToken()
        this.previousPendingTokenType = token.type;
        if (token.channel === Token.DEFAULT_CHANNEL) {
            this.lastPendingTokenTypeFromDefaultChannel = this.previousPendingTokenType;
        }
        this.pendingTokens.push(token);
    }

    private getCommonTokenByToken(oldToken: Token): CommonToken {
        const cToken = new CommonToken([this, oldToken.getInputStream()], oldToken.type, oldToken.channel, oldToken.start, oldToken.stop);
        cToken.tokenIndex = oldToken.tokenIndex;
        cToken.line = oldToken.line;
        cToken.column = oldToken.column;
        cToken.text = oldToken.text;
        return cToken;
    }

    private getIndentationLength(textWS: string): number {
        const TAB_LENGTH = 8; // the standard number of spaces to replace a tab to spaces
        let length = 0;

        for (let ch of textWS) {
            switch (ch) {
                case " ":
                    length += 1;
                    break;
                case "\t":
                    length += TAB_LENGTH - (length % TAB_LENGTH);
                    break;
            }
        }
        return length;
    }

    public reset() {
        this.init();
        super.reset();
    }
}
SuraiyaBegumK commented 5 months ago

Hello RobEin

Thank you for sharing detailed logic.

Now I don't see DEFAULT_CHANNEL/HIDDEN_CHANNEL in the Token.d.ts. Could you please help me out here @RobEin .

image

Token.d.ts

import {CharStream} from "./CharStream";

export declare class Token {

static EOF: number;

tokenIndex: number;
line: number;
column: number;
channel: number;
text: string;
type: number;
start : number;
stop: number;

clone(): Token;
cloneWithType(type: number): Token;
getInputStream(): CharStream;

}

RobEin commented 5 months ago

Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.

You can rebuild it with:

cd .\node_modules\antlr4
npm run build

Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

SuraiyaBegumK commented 5 months ago

Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.

You can rebuild it with:

cd .\node_modules\antlr4
npm run build

Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

I have provided constants for DEFAULT_CHANNEL/HIDDEN_CHANNEL. It worked, Thank you Robein.