GregRos / parjs

JavaScript parser-combinator library
MIT License
272 stars 18 forks source link

feature: Case insensitive string parsing #91

Open Pamplemousse opened 9 months ago

Pamplemousse commented 9 months ago

I encountered a use case where I would need a parser that could consume a string, but insensitive to the case of the letters it contains.

For example, I would like a parser that matches "foo", "Foo", "fOo", etc.

import { string } from "parjs";

const p = string("foo");

p.parse("foo");
// ParjsSuccess { value: 'foo', kind: 'OK' }

p.parse("Foo");
// ParjsFailure {
//   trace: {
//     userState: ParserUserState {},
//     position: 0,
//     reason: "expecting 'foo'",
//     input: 'Foo',
//     location: [Getter],
//     stackTrace: [Array],
//     kind: 'Soft'
//   }
// }

Related:

mikavilpas commented 8 months ago

I think this is a good idea 👍🏻 I will try to find some time to implement it.

If you would like to give it a shot, let me know. It could be a nice task.

barona-mika-vilpas commented 8 months ago

Looking at the reference implementation in the first link you provided:


-- |Case-insensitive variant of Parsec's 'char' function.

caseChar        :: Char -> CharParser st Char
caseChar c       = satisfy (\x -> toUpper x == toUpper c)

-- |Case-insensitive variant of Parsec's 'string' function.

caseString      :: String -> CharParser st ()
caseString cs    = mapM_ caseChar cs <?> cs

It looks like the parser checks each char.toUpperCase() matches, one by one (probably? not sure but I think haskell usually evaluates lazily), or fails with the error message for the given string.

We have uniUpper and upper parsers, it would probably make sense to have a case insensitive version of each.

Pamplemousse commented 8 months ago

I tried to use different existing parsers and combinators, but never got something that worked to my liking. I followed a similar approach to the one you described, but the closest thing I got had funny results when used with the or combinator (from memory as I don't have access to the laptop I tried on right now):

const p1 = caseString("foo");
const p2 = caseString("faa");
const p = p1.pipe(or(p2));

p.parse("FOO"); # worked
p2.parse("FAA"); # worked
p.parse("FAA"); # failed with "unexpected `a`"

I switched to implement it as a parser in this repo, as a close copy of the current string implementation. Currently going through the process with my employer to be able to contribute it here...

GregRos commented 8 months ago

Totally agree, this is required.

We do have an anyChar parser that matches any character. It doesn't need a unicode version. uniUpper is needed because of the uppercase restriction. Checking if a unicode character is upper case requires a lookup in a data structure (interval tree).

A case-insensitive version of string would need to be written custom as @Pamplemousse said because after applying a parser once, most combinators go into hard fail mode and if the next application fails unexpectedly then it will be a hard failure.

I would love for you to contribute it :smile:.

A more general approach

While I think the case-instensitive string parser is needed, I also have another idea. Something interesting that hasn't been done before.

Suppose we require all parsers to implement an operator called caseInsensitive which returns a case insensitive version of themselves. Combinators would implement it by delegating to the child parsers, while building block parsers would return a different parser.

For example, uniUpper.caseInsensitive would be anyChar. string would give the case-insensitive version @Pamplemousse talked about.

The benefit is that you don't have to worry about case sensitivity when building the parser. You could build it however you want and then apply caseInsensitive and get a case insensitive version.

For some combinators this might not work as expected. For example, there is no way to convert must to a case-insensitive version. However, that's a parser that applies a predicate on the output, so it is different.

The question

The question is whether case insensitivity is general enough to warrant people who write parsers to implement the operation.