cucumber / cucumber-expressions

Human friendly alternative to Regular Expressions
MIT License
148 stars 51 forks source link

Feature: named capture groups #206

Open ciaranmcnulty opened 1 year ago

ciaranmcnulty commented 1 year ago

I've been looking at how cucumber-expressions could be adopted in Behat.

One feature of our existing two pattern syntaxes (regex and turnip) is that they can name the arguments.

e.g.

@When /^I eat (?<count>[0-9]+) (?<fruit>.*)$/
@When I eat :count :fruit

This allows Behat to match arguments based on name as well as position, letting users transpose the argument order if necessary.

That can be useful if multiple patterns are attached to one step, something allowed in Behat:

/**
 * @When I eat :count :fruit
 * @When :count of the :fruit are eaten
 * @When I don't eat any :fruit
 */
public function myStepDef(string $fruit, int $count=0): void

The new feature would be for Cucumber Expressions to capture argument names:

  1. Define a syntax for adding names to expressions (e.g. {fruit:string})
  2. Retain that name in the generated regex as a named group
  3. Attach the name to the matched Argument value objects that the parser outputs (with an accessor method)

Then it would be up to the Cucumber implementation to either use the names for matching to the step definition, or to ignore them and use the order directly as currently happens.

mpkorstanje commented 1 year ago

If we decide to extend Cucumber expressions, which ever syntax we pick, we should take some care to ensure that we don't collide with existing usage. For this example, people may already use : in their cucumber expressions.

I can also image that Behat users would like to continue to use their turnip expressions. How would be facilitate this?

The one thing we can probably do without breaking things is make the TreeRegex capture group name aware.

ciaranmcnulty commented 1 year ago

I can also image that Behat users would like to continue to use their turnip expressions. How would be facilitate this?

This part's fine, they'd just import a different attribute (~= annotation in Java)

I think we can restrict it to attributes rather than the older comment-style annotations I showed above:

use Behat\Given as OldGiven;
use Cucumber\Given;

...

#[Given('I eat {string}')]
#[OldGiven('I eat :fruit')]
ciaranmcnulty commented 1 year ago

If we decide to extend Cucumber expressions, which ever syntax we pick, we should take some care to ensure that we don't collide with existing usage

Agreed, it doesn't look as if the grammar restricts it at all so it may be tricky to find a syntax

ciaranmcnulty commented 1 year ago

Thinking about it, the grammar doesn't restrict what chars are allowed but I bet if we fuzz the existing parser implementations a little we can find some chars that aren't currently allowed :)

My pref would be : because it's like turnip, and because PHP is a dynamic language we often don't need to cast stuff.

I'd quite like something along the lines of:

{type}:{label}   // does this collide with real-life usage?
{type}           // to support current usage, omitting the colon is allowed if there is no label
:{label}         // label-only so it stays as a string 
{}:{label}       // if the above is too BC-breaking we could allow empty type == string

Even if we break backwards compatibility we could do that in an Expressions major (depending on the impact that has across different package manager ecosystems)

mpkorstanje commented 1 year ago

I would prefer to keep all the syntax inside the {}. That way we would only limit the names of the parameters and not impact other parts of expressions. So:

{} // Anonymous, as is
{type}  // As is
{:label}  // Anonymous (.*) with a label
{type:label} // type and label

Now I really don't see a way to make this graceful, but if we release this with a feature toggle, we can also smooth out other migration problems i.e. I'd like to use this in Cucumber JVM but not enable this until the next major of Cucumber-JVM. Otherwise all future patches get stuck behind this breaking change.

ciaranmcnulty commented 1 year ago

I'd like to use this in Cucumber JVM

Nice :)

FYI The way we do it in behat is we match on name first and then on position, but there are some awkward edge cases

Given :X :Y :Z
function ($X, $Y, $Z) // as expected, matched on name
function ($a, $b, $c) // as expected, matched on position
function ($a, $X, $b) // would receive $a=Y $X=:X $b=:Z which can be counterintuitive with typos

(we also populate some arguments from a DI container)

We don't currently implement type-based matching though and I'm wondering if that'd be a nice feature to add, given Cucumber Expressions can capture value object types explicitly

ciaranmcnulty commented 1 year ago

@mpkorstanje a feature toggle is a GREAT idea to avoid the BC issue entirely

ciaranmcnulty commented 1 year ago

If it's feature-toggled perhaps I'll pilot this in the PHP implementation?

mpkorstanje commented 1 year ago
Given :X :Y :Z
function ($a, $X, $b) // would receive $a=Y $X=:X $b=:Z which can be counterintuitive with typos

Ah fair enough. I'll stick to positional until someone asks for it.

Anyway, what about Given {:A} {:A} {:A}? Is that an error in creating the cucumber expression?

ciaranmcnulty commented 1 year ago

I think that's {:label} // Anonymous (.*) with a label

mpkorstanje commented 1 year ago

Sure, but how would it or the equivalent turnip or regex with named groups map? How would a regex with named and unnamed groups match?

mattwynne commented 1 year ago

I just want to give this a big 👍

Seems like it could open up some interesting possibilities.

mladedav commented 6 months ago

I would really like to have this feature too. Having named arguments prevents many mistakes when just ordering is used, especially early on when the expressions change often.

As for having multiple definitions for the same name - I think that this should be an error. Regex from my experience also fails to compile when there are multiple capture groups with the same name.

And I also think that we cucumber could disallow mixing named and positional arguments. One expression would have to commit to one or the other. I personally don't see a scenario where that would cause issues.

Otherwise I am also in favor of the {type:label} grammar. Or {label:type} since that is closer to what I'm used to, but either is fine.

As an aside, what would have to happen now for this to move forward? Is there some formal RFC process or does this just need implementation?