[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream

SuraiyaBegumK commented 5 months ago

I created a parser in TypeScript for YAML like language but my language is more simpler one. I want to handle the indents and dedents when newline token occurs

Issue 1: Observed that everytime I try to get next token using super.nextToken function it is directly calling emitToken() fucntion and pushing a token in tokens array. Due to this token is getting pushed before the check needs to be performed (Example: Added condition to skip the white spaces using Skip() but token is already pushed before reaching this line)

Issue 2: When I create instance of Lexer I can see all the tokens those got pushed (Even unnecessary tokens: Issue 1 ) but after trying convert them into TokenStream using 'CommonTokenStream', I can't see the tokens I pushed in the tokens array but I noticed in side tokenSource there are tokens but unable to access them

After passing Lexer to CommonTokenStream

Lexer Class in ANTLR4

`export declare class Lexer extends Recognizer {

static DEFAULT_MODE: number;

_input: CharStream;
_interp: LexerATNSimulator;
text: string;
line: number;
column: number;
_tokenStartCharIndex: number;
_tokenStartLine: number;
_tokenStartColumn: number;
_type: number;

constructor(input: CharStream);
reset(): void;
nextToken(): Token;
skip(): void;
more(): void;
more(m: number): void;
pushMode(m: number): void;
popMode(): number;
emitToken(token: Token): void;
emit(): Token;
emitEOF(): Token;
getAllTokens(): Token[];

} `

Logic implemented to Push Indents and Dedents token array

`import { CharStream, Token, CommonToken, Lexer } from "antlr4"; import MyParser from "./MyParser";

export default class MyLexerBase extends Lexer { /**

Our document does not explicity provide begin and end nesting tokens, the indentation is used to determine the nesting level.
So solve this, we need to keep track of the indentation level and emit the INDENT and DEDENT tokens when the indentation level changes.
Multiple DEDENT tokens may be emitted if the indentation level decreases. The same needs to be sent to the parser without any input symbols.
e.g.
Example 1:
if a == 1:
print("a")
print("b")
if a > 2:
print("c")
print("b")
Output: if a == 1: \n INDENT print("a") \n print("b") \n if a > 2: \n INDENT print("c") \n DEDENT DEDENT print("b") \n EOF
Example 2:
table Sales
column id
datatype: Int64
primaryKey
summarizeBy: None
Output: table Sales NEWLINE INDENT column id NEWLINE INDENT datatype : Int64 NEWLINE primaryKey NEWLINE summarizeBy : None NEWLINE DEDENT DEDENT EOF

*/ tokens: any[]; indents: any[]; opened: number;

constructor(input: CharStream) { super(input); this.tokens = []; this.indents = []; this.opened = 0; }

reset() { // A queue where extra tokens are pushed on (see the NEWLINE lexer rule). this.tokens = [];

// The stack that keeps track of the indentation level.
this.indents = [];

// The amount of opened braces, brackets and parenthesis.
this.opened = 0;

super.reset();

}

// Override emit method to customize token emission if necessary emitToken(token: Token) { try { super.emitToken(token); this.tokens.push(token); } catch (error) { console.error("Error occurred during token emission"); // Handle the error as needed } }

// Override nextToken() to add custom logic for emitting additional tokens public nextToken(): Token { const nextToken: Token = super.nextToken();

if (nextToken.channel === 0) {
    // If token is on default channel, handle it
    // Add custom logic to emit additional INDENT and DEDENT tokens based on existing tokens
    if (nextToken.type === MyParser.NEWLINE) {
        // Emit additional tokens here
        this.emitAdditionalToken();
        // console.log("Text: NEWLINE" + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
    }
    else if (nextToken.type === MyParser.WS) {
        this.skip();
    }
    else {
        // if not a NEWLINE token, return the token as-is
        // console.log("Text: " + nextToken.text + ", Type: " + nextToken.type + ", Channel: " + nextToken.channel);
    }
    return nextToken;
} else {
    // Otherwise, do nothing or handle it as per your requirement.
    return nextToken;
}

}

/**

*/ emitAdditionalToken() { let newLine = this.text.replace(/[^\r\n]+/g, '');

// Strip newlines inside open clauses except if we are near EOF. We keep NEWLINEs near EOF to
// satisfy the final newline needed by the single_put rule used by the REPL.
let next = this._input.LA(1);
let nextnext = this._input.LA(2);
if (this.opened > 0 || (nextnext != -1 /* EOF */ && (next === 13 /* '\r' */ || next === 10 /* '\n' */ || next === 35 /* '#' */))) {
    // If we're inside a list or on a blank line, ignore all indents,
    // dedents and line breaks.
    this.skip();
} else {
    this.emitToken(this.commonToken(MyParser.NEWLINE, newLine));

    // let nextToken = this.nextToken();
    // let spaces = nextToken.text.replace(/[\r\n]+/g, '');
    // let cpos = this.getIndentationCount(spaces);

    // Todo:
    let cpos = 0;
    this.emitToken(this.commonToken(MyParser.INDENT, " "));
    console.log("Adding indent token to the stack.");

    // console.log("Indentation count: " + cpos);
    let previous = this.indents.length ? this.indents[this.indents.length - 1] : 0;

    if (cpos === previous) {
        // skip indents of the same size as the present indent-size
        this.skip();
    } else if (cpos > previous) {
        console.log("Adding indent token to the stack.");
        this.indents.push(cpos);
        // console.log("Indentation level: " + cpos);
        //this.emitToken(this.commonToken(MyParser.INDENT, spaces));
    } else {
        // Possibly emit more than 1 DEDENT token.
        while (this.indents.length && this.indents[this.indents.length - 1] > cpos) {
            this.emitToken(this.createDedent());
            // console.log("Removing indent token from the stack.")
            this.indents.pop();
        }
    }
}

}

createDedent() { return this.commonToken(MyParser.DEDENT, ""); }

getCharIndex() { return this._input.index; }

commonToken(type: number, text: string) { let stop = this.getCharIndex() - 1; let start = text.length ? stop - text.length + 1 : stop; return new CommonToken([this, this._input], type, 0, start, stop); }

getIndentationCount(whitespace: string) { let count = 0; for (let i = 0; i < whitespace.length; i++) { if (whitespace[i] === '\t') { count += 8 - count % 8; } else { count++; } } return count; } }

`

kaby76 commented 5 months ago

for YAML like language

This bug applies to which grammar in this repo, grammars-v4? We don't have a grammar for yaml.

I recommend that you try the Antlr4ng tool and runtime. There is a TypeScript target for Antlr 4.13.1, but it is unlikely any fixes will be addressed in that code. You can then raise an Github Issue over there if you see a problem with the runtime. You will need to use the Antlr4ng-cli tool to generate the updated code for your parser.

SuraiyaBegumK commented 5 months ago

I have my own grammar I referred Python sample in this repo for Indent handling logic

// Override emit method to customize token emission if necessary emitToken(token: Token) { super.emitToken(token); this.tokens.push(token); }

where I am emitting Intent and Dedent custom tokens, they are not getting emitted

RobEin commented 5 months ago

I don't know if it helps. I use something like this to insert INDENT and DEDENT tokens for Python Lexers with TypeScript target. There is no emitToken().

import { CharStream, Token, CommonToken, Lexer } from "antlr4";
import PythonLexer from "./PythonLexer";
import * as Collections from "typescript-collections";

export default abstract class PythonLexerBase extends Lexer {
    // A stack that keeps track of the indentation lengths
    private indentLengthStack!: Collections.Stack<number>;
    // A list where tokens are waiting to be loaded into the token stream
    private pendingTokens!: Array<Token>;

    // last pending token types
    private previousPendingTokenType!: number;
    private lastPendingTokenTypeFromDefaultChannel!: number;

    private curToken: CommonToken | undefined; // current (under processing) token
    private ffgToken: Token | undefined;       // following (look ahead) token

    protected constructor(input: CharStream) {
        super(input);
        this.init();
    }

    private init(): void {
        this.indentLengthStack = new Collections.Stack<number>();
        this.pendingTokens = [];
        this.previousPendingTokenType = 0;
        this.lastPendingTokenTypeFromDefaultChannel = 0;
        this.curToken = undefined;
        this.ffgToken = undefined;
    }

    public nextToken(): Token { // reading the input stream until a return EOF
        this.checkNextToken();
        return this.pendingTokens.shift()!; // add the queued token to the token stream
    }

    private checkNextToken(): void {
        if (this.previousPendingTokenType !== PythonLexer.EOF) {
            this.setCurrentAndFollowingTokens();
            if (this.indentLengthStack.isEmpty()) { // We're at the first token
                this.handleStartOfInput();
            }

            switch (this.curToken!.type) {
                case PythonLexer.NEWLINE:
                    this.handleNEWLINEtoken();
                    break;
                case PythonLexer.EOF:
                    this.handleEOFtoken();
                    break;
                default:
                    this.addPendingToken(this.curToken!);
            }
        }
    }

    private setCurrentAndFollowingTokens() {
        this.curToken = this.ffgToken == undefined
            ? this.getCommonTokenByToken(super.nextToken())
            : this.getCommonTokenByToken(this.ffgToken);

        this.ffgToken = this.curToken.type === PythonLexer.EOF
            ? this.curToken
            : this.getCommonTokenByToken(super.nextToken());
    }

    private handleStartOfInput() {
        // initialize the stack with a default 0 indentation length
        this.indentLengthStack.push(0); // this will never be popped off
    }

    private handleNEWLINEtoken() {
        const nlToken = this.curToken!; // save the current NEWLINE token
        const isLookingAhead = this.ffgToken!.type === PythonLexer.WS;
        if (isLookingAhead) {
            this.setCurrentAndFollowingTokens(); // set the next two tokens
        }

        switch (this.ffgToken!.type) {
            case PythonLexer.NEWLINE: // We're before a blank line
            case PythonLexer.COMMENT: // We're before a comment
                this.hideAndAddPendingToken(nlToken);
                if (isLookingAhead) {
                    this.addPendingToken(this.curToken!);  // WS token
                }
                break;
            default:
                this.addPendingToken(nlToken);
                if (isLookingAhead) { // We're on whitespace(s) followed by a statement
                    const indentationLength = this.ffgToken!.type === PythonLexer.EOF ?
                        0 :
                        this.getIndentationLength(this.curToken!.text);

                    this.addPendingToken(this.curToken!); // WS token
                    this.insertIndentOrDedentToken(indentationLength); // may insert INDENT token or DEDENT token(s)
                } else { // We're at a newline followed by a statement (there is no whitespace before the statement)
                    this.insertIndentOrDedentToken(0); // may insert DEDENT token(s)
                }
        }
    }

    private insertIndentOrDedentToken(curIndentLength: number) {
        let prevIndentLength: number = this.indentLengthStack.peek()!;
        if (curIndentLength > prevIndentLength) {
            this.createAndAddPendingToken(PythonLexer.INDENT, Token.DEFAULT_CHANNEL, "INDENT", this.ffgToken!);
            this.indentLengthStack.push(curIndentLength);
        } else {
            while (curIndentLength < prevIndentLength) { // more than 1 DEDENT token may be inserted to the token stream
                this.indentLengthStack.pop();
                prevIndentLength = this.indentLengthStack.peek()!;
                if (curIndentLength <= prevIndentLength) {
                    this.createAndAddPendingToken(PythonLexer.DEDENT, Token.DEFAULT_CHANNEL, "DEDENT", this.ffgToken!);
                } else {
                    // this.reportError("inconsistent dedent");
                }
            }
        }
    }

    private insertTrailingTokens() {
        switch (this.lastPendingTokenTypeFromDefaultChannel) {
            case PythonLexer.NEWLINE:
            case PythonLexer.DEDENT:
                break; // no trailing NEWLINE token is needed
            default:
                // insert an extra trailing NEWLINE token that serves as the end of the last statement
                this.createAndAddPendingToken(PythonLexer.NEWLINE, Token.DEFAULT_CHANNEL, "NEWLINE", this.ffgToken!); // ffgToken is EOF
        }
        this.insertIndentOrDedentToken(0); // Now insert as much trailing DEDENT tokens as needed
    }

    private handleEOFtoken() {
        if (this.lastPendingTokenTypeFromDefaultChannel > 0) {
            // there was a statement in the input (leading NEWLINE tokens are hidden)
            this.insertTrailingTokens();
        }
        this.addPendingToken(this.curToken!);
    }

    private hideAndAddPendingToken(cToken: CommonToken) {
        cToken.channel = Token.HIDDEN_CHANNEL;
        this.addPendingToken(cToken);
    }

    private createAndAddPendingToken(type: number, channel: number, text: string, baseToken: Token) {
        const cToken: CommonToken = this.getCommonTokenByToken(baseToken);
        cToken.type = type;
        cToken.channel = channel;
        cToken.stop = baseToken.start - 1;
        cToken.text = text;
        this.addPendingToken(cToken);
    }

    private addPendingToken(token: Token) {
        // save the last pending token type because the pendingTokens list can be empty by the nextToken()
        this.previousPendingTokenType = token.type;
        if (token.channel === Token.DEFAULT_CHANNEL) {
            this.lastPendingTokenTypeFromDefaultChannel = this.previousPendingTokenType;
        }
        this.pendingTokens.push(token);
    }

    private getCommonTokenByToken(oldToken: Token): CommonToken {
        const cToken = new CommonToken([this, oldToken.getInputStream()], oldToken.type, oldToken.channel, oldToken.start, oldToken.stop);
        cToken.tokenIndex = oldToken.tokenIndex;
        cToken.line = oldToken.line;
        cToken.column = oldToken.column;
        cToken.text = oldToken.text;
        return cToken;
    }

    private getIndentationLength(textWS: string): number {
        const TAB_LENGTH = 8; // the standard number of spaces to replace a tab to spaces
        let length = 0;

        for (let ch of textWS) {
            switch (ch) {
                case " ":
                    length += 1;
                    break;
                case "\t":
                    length += TAB_LENGTH - (length % TAB_LENGTH);
                    break;
            }
        }
        return length;
    }

    public reset() {
        this.init();
        super.reset();
    }
}

SuraiyaBegumK commented 5 months ago

Hello RobEin

Thank you for sharing detailed logic.

Now I don't see DEFAULT_CHANNEL/HIDDEN_CHANNEL in the Token.d.ts. Could you please help me out here @RobEin .

Token.d.ts

import {CharStream} from "./CharStream";

export declare class Token {

static EOF: number;

tokenIndex: number;
line: number;
column: number;
channel: number;
text: string;
type: number;
start : number;
stop: number;

clone(): Token;
cloneWithType(type: number): Token;
getInputStream(): CharStream;

}

RobEin commented 5 months ago

Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.

You can rebuild it with:

cd .\node_modules\antlr4
npm run build

Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

SuraiyaBegumK commented 5 months ago

Already added to Token.d.ts. In principle, it will be included in the next ANTLR (4.13.2). Until then, feel free to use this instead of the current one.

You can rebuild it with:
cd .\node_modules\antlr4
npm run build
Although it may not be necessary to build if only the two constants are inserted (DEFAULT_CHANNEL/HIDDEN_CHANNEL).

I have provided constants for DEFAULT_CHANNEL/HIDDEN_CHANNEL. It worked, Thank you Robein.

antlr / grammars-v4

[TypeScript]: emitToken(token Token) is not pushing new tokens to the TokensArray in CommonTokenStream #4024

Lexer Class in ANTLR4

Logic implemented to Push Indents and Dedents token array