GregRos / parjs

JavaScript parser-combinator library
MIT License
264 stars 17 forks source link

Proposal - parser and combinator modifiers #101

Open GregRos opened 5 months ago

GregRos commented 5 months ago

I’ve been thinking about the solution I proposed to @Pamplemousse’s #91 and I think it can be taken further. I'd love some feedback

There are lots of combinators that can be considered to be modified versions of other combinators. For example, the many combinator has the versions:

And each of those often has multiple optional parameters for tweaking their functionality. This is taken from traditional parser combinator libraries like Parsec of course. But it does have downsides.

A good example of the problem was in #35, where the user wanted to capture the separators in manySepBy. But that can't be a default since lots of people probably don't want that. So either adding an optional parameter or adding another parser, both having downsides.

What if instead of doing that, we have a single many combinator that supports lots of different modifiers to mix and match different additional functionality?

How it would work

Let’s take the many combinator as an example. First we’d have the basic many combinator. This combinator could be modified with features such as separators, minimum/maximum matches, and so on. This would be done as instance methods defined on individual combinators.

For example, many could have the methods:

So a full instantiation of the modified combinator could look like this:

const manyTillnewline1to3 = many().till(newline()).min(1).max(3).sep(",")

Specific operators could be shared among all combinators and parsers, like caseSense which can enable/disable case sense in the whole parser (#91).

You would still apply combinators using pipe to actual parsers. For example,

string("abc").pipe(
    manyTillnewline1to3, 
    then(string("abc"))
)

Advantages

The main advantage is the branching API. It groups extra features around core functionality instead of having a giant list of combinators.

Disadvantages

  1. Departs from the interface of libraries like Parsec.
  2. Unusual. I haven’t seen any combinator-based library do this.
  3. Will involve deprecating existing parsers to maintain uniformity. Or something along those lines. Maybe a separate import.

Method of rollout

If people like the idea, I’d like to do a gradual rollout. I think just replacing the interface with one major version change is too much. It should be spread over 1 – 3 version upgrades.

Feedback

What do you think?

mikavilpas commented 5 months ago

I think that is a good idea. It could simplify these tricky cases under a single "right way" to do things.

In general I think instance methods are also a good programmer experience. If you can stumble on a place to start, the available fields and methods are easy to find.

As for being unusual, maybe that is true. Usually I think parser combinator libraries emphasize functions over classes. On the other hand, functions are still exported in modules and it pretty much looks similar if you squint a bit.

On scope

Are there other combinators that could be built in steps like this? The manyXXX "family" of combinators seem to fall under this quite nicely.

I looked at the combinators that are available now, and didn't notice any that would fit as well right now. However, you mentioned string casing, then there is also #98 that could have a similar interface (although I haven't read that with a lot of thought yet). I'm thinking this could possibly extend to creating parsers as well as combinators.

Ideally I think the best would be to have a similar experience for all of these in some way.

Tree shaking support

This is a bit of an open question. I'm not sure what the state of tree shaking is these days with new bundlers having popped up since the last time I looked at this a couple of years ago.

By default, I think at some point instance methods could not be eliminated, but this might very well have changed.

In fact, it might be a good idea for me to delve into this subject a bit - maybe the entire project could use some visibility on the status of this 🤔

GregRos commented 5 months ago

Another note @sp3ctum lol we posted at the same time A lot of combinators such as manyTill and manySepBy can have their return value tweaked.

What if each parser returns several named values, and the thing you get when you call parse is just the return value with the name value? However, you can access the rest of the returns using a returns combonator.

This combonator would let you remap value using the other things parsers captured that aren’t the default return.

Like this:

const manyWithSeps = many().sep(",").returns(
    dict => [...dict.value, ...dict.separators]
)

This enables a lot of versatility without complicating the interface. The extra captured values are there if you need them.

GregRos commented 5 months ago

@sp3ctum I'm glad you like the idea!

I do agree about instance methods, they really are nice to have and make exploring a library’s API much easier.

Looking at tree shaking sounds like an promising avenue for research. Honestly I’m not even sure if the library tree shakes properly. I tried to make it work and I'm pretty sure I tested it ages ago but like you said, bundlers have changed since then.

About creating parsers – you’re right! The number parsers could be created this way for example, instead of having to pass an object. Or maybe it’s just another option.

You could have one char parser and one charCode parser and just tweak them to filter the character you want to read.

Another idea is to be able to “extend” existing combinators or parsers with “more of the same”. For example:

const x_or_y = or("x", "y")
const or_z = x_or_y.or("z") // Parses "x" | "y" | "z"

const x_then_y = then("x", "y")
const xyz = z_then_y.then("z") // Returns ["x", "y", "z"]

I think it’s especially convenient with then since then you can build a sequence incrementally. The current approach, to use flatten, has always felt wrong with me.

One problem with wrapping stuff with functions all the time is the overhead. In Haskell you have crazy optimizations that inline function values, unroll recursion through magic, and get rid of allocations. But in most languages calling a function that isn't a full-on class method can never be inlined

So something like several nestings of then will always have performance overhead since each nesting corresponds to a function that must be called. But this way users don't have to nest thens and can instead build them.