GregRos / parjs

JavaScript parser-combinator library
MIT License
273 stars 18 forks source link

Help: `or()` is not working #109

Closed sheey11 closed 6 months ago

sheey11 commented 6 months ago

I am new to ParserCombinators and I am trying to write a parser for a LaTeX-like langugae, examples:

My code so far:

Constants and type definitions ```typescript import { anyChar, anyCharOf, anyStringOf, digit, eof, float, noCharOf, regexp, string, whitespace } from 'parjs' import { between, later, many, manySepBy, map, maybe, mustCapture, or, qthen, stringify, then, thenq } from 'parjs/combinators' enum FormulaType { Number = "number", Symbol = "symbol", Operator = "operator", Matrix = "matrix", Fraction = "fraction", List = "list", } interface FormulaComponentCommon { type: FormulaType } interface NumberComponent extends FormulaComponentCommon { type: FormulaType.Number, num: number, padding?: boolean, } interface SymbolComponent extends FormulaComponentCommon { type: FormulaType.Symbol, text?: string, superscript?: string, subscript?: string, roman?: boolean, } interface OperatorComponent extends FormulaComponentCommon { type: FormulaType.Operator, text: string, } interface MatrixComponent extends FormulaComponentCommon { type: FormulaType.Matrix, h: number, w: number, elements: FormulaExpression[][] } interface ListComponent extends FormulaComponentCommon { type: FormulaType.List, elements: FormulaExpression[] } interface FractionComponent extends FormulaComponentCommon { type: FormulaType.Fraction, upper: FormulaExpression, lower: FormulaExpression, } type FormulaComponent = NumberComponent | SymbolComponent | OperatorComponent | MatrixComponent | ListComponent | FractionComponent type FormulaExpression = FormulaComponent[] const functions = [ "matrix", "frac", "roman", ] const escapes = [ "\\", "{", "}", "(", ")", "[", "]", "_", "^", ] const mathSymbols = [ "alpha", "beta", "gamma", "epsilon", "ne", "ge", "le", "Delta", "partial", "int", ] ```
const pExpression = later()
const pSymbol = later() // y_1, w_{11}, x_1^2

const pOneCharToken = noCharOf("~!@#$%^&*()_+-`[]\\{}|;':\",./<>? \t1234567890")
const pEscape = string('\\').pipe(qthen(anyStringOf(...escapes)))
const pNumber = regexp(/\d+(?:\.\d+)?/).pipe(map(([n]) => ({ type: FormulaType.Number, num: +n })))

const pSimpleSymbol = pOneCharToken.pipe(
  or(pEscape),
  mustCapture(), // must not start with a number
  or(digit()),
  many(),
  stringify(),
  map(symbol => ({ type: FormulaType.Symbol, text: symbol }))
) // symbol has no subscript or superscript

const pBracketedExpression = pExpression.pipe(between("{", "}"))
const pSubscriptExpression = string("_").pipe(then(pBracketedExpression.pipe(or(pOneCharToken, pEscape, digit()))))
const pSuperscriptExpression = string("^").pipe(then(pBracketedExpression.pipe(or(pOneCharToken, pEscape, digit()))))

const pSymbolSuperSubScript = pSimpleSymbol.pipe(
  then(pSuperscriptExpression, pSubscriptExpression),
  map(([symbol, [_1, superExpr], [_2, subExpr]]) => ({ ...symbol, subscript: subExpr, superscript: superExpr }))
)

const pSymbolSubSuperScript = pSimpleSymbol.pipe(
  then(pSubscriptExpression, pSuperscriptExpression),
  map(([symbol, [_1, subExpr], [_2, superExpr]]) => ({ ...symbol, subscript: subExpr, superscript: superExpr }))
)

const pSymbolSubscript = pSimpleSymbol.pipe(
  then(pSubscriptExpression),
  map(([symbol, [_, expr]]) => ({ ...symbol, subscript: expr }))
)

const pSymbolSuperscript = pSimpleSymbol.pipe(
  then(pSuperscriptExpression),
  map(([symbol, [_, expr]]) => ({ ...symbol, superscript: expr }))
)

pSymbol.init(pSymbolSubSuperScript.pipe(
  or(
    pSymbolSubscript,
    pSymbolSuperSubScript,
    pSymbolSuperscript,
  )
))

const pOperator = anyCharOf("~!@#$%&*+`|;':\",./<>?")
const pListContent = pExpression.pipe(manySepBy(','))
const pList = pListContent.pipe(between("[", "]"))
const pFunctionArgs = pListContent.pipe(between("(", ")"))
const pFunctionCall = anyStringOf(...functions).pipe(then(whitespace(), pFunctionArgs))

const pAnyExpr = pSymbol.pipe(or(pOperator, pFunctionCall, pList, pNumber), between(whitespace()))
pExpression.init(pAnyExpr.pipe(manySepBy(whitespace())))

console.dir(pSymbolSuperSubScript.parse("weight^2_1"), { depth: null }) // this returns OK
console.dir(pSymbol.parse("weight^2_1"), { depth: null }) // but not this one

as defined, pSymbol = pSymbolSuperSubScriptif.pipe(or(...)), if pSymbolSuperSubScript parsers a string successfully then pSymbol should also.

mikavilpas commented 6 months ago

Hi, I'm not sure I understand all of that but it might be caused by a common issue where you want to apply one of many parsers and just accept the result of the first one that works.

It might be caused by this code block:

pSymbol.init(pSymbolSubSuperScript.pipe(
  or(
    pSymbolSubscript,
    pSymbolSuperSubScript,
    pSymbolSuperscript,
  )
))

Here, it's typical that a parser in the middle would succeed, but a previous parser has already consumed input and uses e.g. the then combinator. This has two effects:

You might understand what is happening by temporarily using .debug() on the different parsers and re-running your parsing test (https://github.com/GregRos/parjs?tab=readme-ov-file#debugging).

What I think you want to do is recover in this way:

    it.only("can be used with backtracking", () => {
        // Create a silly example for a shopping list with two possible sections.
        // Notice that both start by parsing the same string. This is to show that
        // the `recover` combinator can be used to backtrack to the beginning of the section
        const fruitSection = string("Remember to buy ").pipe(
            then(string("apples").pipe(or(string("bananas"))))
        );
        const vegetablesSection = string("Remember to buy ").pipe(
            then(string("carrots").pipe(or(string("potatoes"))))
        );

        const shoppingList = fruitSection
            .pipe(
                recover(() => ({ kind: "Soft" })),
                or(vegetablesSection)
            )
            .debug();

        expect(shoppingList.parse("Remember to buy carrots")).toBeSuccessful([
            "Remember to buy ",
            "carrots"
        ]);
        expect(shoppingList.parse("Remember to buy potatoes")).toBeSuccessful([
            "Remember to buy ",
            "potatoes"
        ]);
    });
sheey11 commented 6 months ago

Sorry for late reply, I now understand why error occors and completed my entire parser, many thanks!

But I encountered another error while testing pExpression:

ParserDefinitionError: manySepBy: The combinator 'manySepBy' expected one of its arguments to change the parser state.

I think it's caused by

pExpression.init(
  pAnyExpr.pipe(
    manySepBy(whitespace())
  )
)

If I add maxIterations argrument to manySepBy, it works very well, but there shouldn't be a number limitation.

Even if I pipe thenq(eof()) then parse entire formula, this error still occur.

Full Code ```typescript import { Parjser, anyCharOf, anyStringOf, digit, eof, noCharOf, regexp, string, whitespace } from 'parjs' import { DelayedParjser, between, later, many, manySepBy, map, must, mustCapture, or, qthen, recover, stringify, then, thenq } from 'parjs/combinators' // matrix( [y_1], [y_2] ) = matrix( [w_{11}, w_{21}], [ w_{12}, w_{22} ] ) matrix( [x_1], [x_2] ) // frac(roman(d) x, roman(d) y) enum FormulaType { Number = "number", Symbol = "symbol", Operator = "operator", Matrix = "matrix", Fraction = "fraction", List = "list", } interface FormulaComponentCommon { type: FormulaType } interface NumberComponent extends FormulaComponentCommon { type: FormulaType.Number, num: number, padding?: boolean, } interface SymbolComponent extends FormulaComponentCommon { type: FormulaType.Symbol, text?: string, superscript?: FormulaExpression, subscript?: FormulaExpression, roman?: boolean, } interface OperatorComponent extends FormulaComponentCommon { type: FormulaType.Operator, text: string, } interface MatrixComponent extends FormulaComponentCommon { type: FormulaType.Matrix, h: number, w: number, elements: FormulaExpression[][] } interface ListComponent extends FormulaComponentCommon { type: FormulaType.List, elements: FormulaExpression[] } interface FractionComponent extends FormulaComponentCommon { type: FormulaType.Fraction, upper: FormulaExpression, lower: FormulaExpression, } type FunctionComponent = MatrixComponent | FractionComponent type FormulaComponent = NumberComponent | SymbolComponent | OperatorComponent | MatrixComponent | ListComponent | FractionComponent type FormulaExpression = FormulaComponent[] const functions = [ "matrix", "frac", ] as const type FunctionTypes = typeof functions[number] const escapes = [ "\\", "{", "}", "(", ")", "[", "]", "_", "^", ] const mathSymbols = [ "alpha", "beta", "gamma", "epsilon", "ne", "ge", "le", "Delta", "partial", "int", ] const pExpression: DelayedParjser = later() const pSymbol: DelayedParjser = later() // y_1, w_{11}, x_1^2 const pValidOneChar = noCharOf("~!@#$%^&*()_+-`[]\\{}|;':\",./<>? \t1234567890") const pEscape = string('\\').pipe(qthen(anyStringOf(...escapes))) const pOneCharToken: Parjser = noCharOf(" \t_^{").pipe( or(pEscape, digit()), map((c) => ({ type: FormulaType.Symbol, text: c, })) ) const pNumber: Parjser = regexp(/\d+(?:\.\d+)?/).pipe(map(([n]) => ({ type: FormulaType.Number, num: +n }))) const pSimpleSymbol: Parjser = pValidOneChar.pipe( or(pEscape), mustCapture(), or(digit()), many(), stringify(), map(symbol => ({ type: FormulaType.Symbol, text: symbol })) ) // symbol has no subscript or superscript const pBracketedExpression = pExpression.pipe(between("{", "}")) const pSubscriptExpression = string("_").pipe(then(pBracketedExpression.pipe(or(pOneCharToken)))) const pSuperscriptExpression = string("^").pipe(then(pBracketedExpression.pipe(or(pOneCharToken)))) pSymbol.init(pSimpleSymbol.pipe( then( pSubscriptExpression.pipe( or(pSuperscriptExpression), many(2), ) ), must(([_, scripts]: [SymbolComponent, ['_' | '^', FormulaExpression | SymbolComponent][]]) => { if (scripts.length === 2 && scripts[0][0] === scripts[1][0]) return { type: "Hard", reason: `multipe ${scripts[0][0] === '_' ? 'sub' : 'super'}scripts`, } return true }), map(([symbol, scripts]) => { for (const [scriptType, script] of scripts) { switch (scriptType) { case "_": symbol.subscript = script instanceof Array ? script : [script] break case "^": symbol.superscript = script instanceof Array ? script : [script] break } } return symbol }) )) const pOperator: Parjser = anyCharOf("~!@#$%&*+`|;':\",./<>?").pipe( map((c) => ({ type: FormulaType.Operator, text: c, })) ) const pListContent: Parjser = pExpression.pipe(manySepBy(',')) const pList: Parjser = pListContent.pipe( between("[", "]"), map((args) => ({ type: FormulaType.List, elements: args, })), ) const pFunctionArgs: Parjser = pListContent.pipe(between("(", ")")) const pFunctionCall: Parjser = anyStringOf(...functions).pipe( then(whitespace(), pFunctionArgs), must(([fnName, _, args]: [FunctionTypes, string, FormulaExpression[]]) => { if (fnName === "frac" && args.length !== 2) return { type: "Hard", reason: "frac() with not enough argruments" } if (fnName === "matrix" && !args.every(row => row.length === 1 && row[0].type === FormulaType.List)) return { type: "Hard", reason: "matrix() cannot have argruments other than list" } return true }), map(([fnName, _, args]) => { switch (fnName) { case "matrix": return { type: FormulaType.Matrix, elements: args.map(expr => (expr[0] as ListComponent).elements), h: args.length, w: args.reduce((w, curr) => Math.max(w, curr.length), 0) } as MatrixComponent case "frac": return { type: FormulaType.Fraction, upper: args[0], lower: args[1] } as FractionComponent } }), ) const pAnyExpr: Parjser = pOperator.pipe( or( pSymbol.pipe(recover(() => ({ kind: "Soft" }))), pFunctionCall, pList, pNumber, ), ) pExpression.init( pAnyExpr.pipe( manySepBy(whitespace()), ), ) const pFormula = pExpression.pipe(thenq(eof())) // tests // console.dir(pNumber.parse("1213"), { depth: null }) // console.dir(pSimpleSymbol.parse("good123\\^"), { depth: null }) // console.dir(pSymbol.parse("weight_1^2"), { depth: null }) // console.dir(pSymbol.parse("weight^2_1"), { depth: null }) // console.dir(pSymbol.parse("weight"), { depth: null }) // console.dir(pSymbol.parse("weight_1_1"), { depth: null }) // expects error // console.dir(pSymbol.parse("weight_1"), { depth: null }) // console.dir(pSymbol.parse("weight^1"), { depth: null }) // console.dir(pSymbol.parse("xyz_{1}^2"), { depth: null }) // console.dir(pList.parse("[y_1 w_1, y_2]"), { depth: null }) // console.dir(pFunctionCall.parse("matrix([y_1, y_2]))"), { depth: null }) // console.dir(pExpression.parse("y_1 y_2"), { depth: null }) console.dir(pFormula.parse("y_1 y_2"), { depth: null }) ```
mikavilpas commented 6 months ago

That usually occurs when a parser succeeds in parsing the empty string (""). Looping over that will never finish, so the combinator simply crashes instead of going on forever.

I think in your case this can be solved with this:

const pSimpleSymbol: Parjser<SymbolComponent> = pValidOneChar
    .pipe(
        or(pEscape),
        mustCapture(),
        or(digit()),
        many1(), // changed to many1() from many()
        stringify(),
        map(symbol => ({ type: FormulaType.Symbol, text: symbol }))
    )
    .debug(); // symbol has no subscript or superscript

I was able to track that down with the following steps:

Here are some tests that I generated when debugging this (I generated these with copilot and only resolved the first issue I saw):

Details

```ts // tests // console.dir(pNumber.parse("1213"), { depth: null }) // console.dir(pSimpleSymbol.parse("good123\\^"), { depth: null }) // console.dir(pSymbol.parse("weight_1^2"), { depth: null }) // console.dir(pSymbol.parse("weight^2_1"), { depth: null }) // console.dir(pSymbol.parse("weight"), { depth: null }) // console.dir(pSymbol.parse("weight_1_1"), { depth: null }) // expects error // console.dir(pSymbol.parse("weight_1"), { depth: null }) // console.dir(pSymbol.parse("weight^1"), { depth: null }) // console.dir(pSymbol.parse("xyz_{1}^2"), { depth: null }) // console.dir(pList.parse("[y_1 w_1, y_2]"), { depth: null }) // console.dir(pFunctionCall.parse("matrix([y_1, y_2]))"), { depth: null }) // console.dir(pExpression.parse("y_1 y_2"), { depth: null }) // console.dir(pFormula.parse("y_1 y_2"), { depth: null }); describe("Formula Parser", () => { it("should parse a number", () => { expect(pNumber.parse("1213")).toBeSuccessful({ type: FormulaType.Number, num: 1213 }); }); it("should parse a simple symbol", () => { expect(pSimpleSymbol.parse("good123\\^")).toBeSuccessful({ type: FormulaType.Symbol, text: "good123^" }); }); it("should parse a symbol with subscript and superscript", () => { expect(pSymbol.parse("weight_1^2")).toBeSuccessful({ type: FormulaType.Symbol, text: "weight", subscript: [{ type: FormulaType.Symbol, text: "1" }], superscript: [{ type: FormulaType.Symbol, text: "2" }] }); }); it("should parse a symbol with superscript and subscript", () => { expect(pSymbol.parse("weight^2_1")).toBeSuccessful({ type: FormulaType.Symbol, text: "weight", subscript: [{ type: FormulaType.Symbol, text: "1" }], superscript: [{ type: FormulaType.Symbol, text: "2" }] }); }); it("should parse a symbol without subscript or superscript", () => { expect(pSymbol.parse("weight")).toBeSuccessful({ type: FormulaType.Symbol, text: "weight" }); }); it("should not parse a symbol with multiple subscripts", () => { expect(pSymbol.parse("weight_1_1")).toBeFailure(); }); it("should parse a symbol with subscript", () => { expect(pSymbol.parse("weight_1")).toBeSuccessful({ type: FormulaType.Symbol, text: "weight", subscript: [{ type: FormulaType.Symbol, text: "1" }] }); }); it("should parse a symbol with superscript", () => { expect(pSymbol.parse("weight^1")).toBeSuccessful({ type: FormulaType.Symbol, text: "weight", superscript: [{ type: FormulaType.Symbol, text: "1" }] }); }); // eslint-disable-next-line jest/no-identical-title it("should parse a symbol with subscript and superscript", () => { expect(pSymbol.parse("xyz_{1}^2")).toBeSuccessful(); }); }); // NOTE: all test cases are not included ```

sheey11 commented 6 months ago

I think many1 is not exported, I can't import from parjs/combinators:

https://github.com/GregRos/parjs/blob/a65525a215f3ccd0255aab6247bf7d24d9ee98f9/src/lib/combinators.ts#L1-L28

Should I fire a pr to add many1?

110

mikavilpas commented 6 months ago

Version 1.2.3 is now up on npm, can you test with that? It should have exported many1.

sheey11 commented 6 months ago

With 1.2.3 my parser works fine now, I really appreciate your help, thanks!