Nested capture groups are not passed to custom param definition transformer

notaphplover commented 1 year ago

👓 What did you see?

When creating a custom param definition, the transformer funcion does not received nested capturing groups.

✅ What did you expect to see?

As javascript developer, I expect regex capturing groups to conform Regex Ecamscript standard and capture nested groups

📦 Which tool/library version are you using?

https://www.npmjs.com/package/@cucumber/cucumber-expressions/v/16.1.2

🔬 How could we reproduce it?

Create a new npm package with your prefered typescript setup (the bug is also recreated with plain javascript).

In the package.json, install @cucumber/cucumber@9.2.0 dependency

Define a custom parameter with nested captured groups:

import { defineParameterType } from '@cucumber/cucumber';

defineParameterType({
  name: 'foo',
  regexp: /((foo))/,
  transformer: function (
    firstGroup: string,
    secondGroup: string
  ): string[] {
   return [firstGroup, secondGroup]
  },
});

Then, write any step definition and feature involving this custom definition

import { Given, Then, When } from '@cucumber/cucumber';

Given('{foo}', function(foo: string[]): void { console.log(foo) });
When('bar', function(): void {});
Then('baz', function(): void {});

Given foo
When bar
Then baz

Then, create a minimal config testing the feature processing the custom parameter and the step definitions and have a look at the console. Instead of ['foo', 'foo'], ['foo', undefined] is printed instead.

📚 Any additional context?

It seems the capturing groups are internally represented as a tree. Cucumber relies on TreeRegexp to create a GroupBuilder for the purpose of capturing regex groups. These groups are represented as a tree and only the root ones are passed to the custom definition transformer, ignoring any children groups.

Having a look at the Argument class:

public getValue<T>(thisObj: unknown): T | null {
  const groupValues = this.group ? this.group.values : null
  return this.parameterType.transform(thisObj, groupValues)
}

group.values are passed to the transformer:

export default class Group {
  constructor(
    public readonly value: string,
    public readonly start: number | undefined,
    public readonly end: number | undefined,
    public readonly children: readonly Group[]
  ) {}

  get values(): string[] | null {
    return (this.children.length === 0 ? [this] : this.children).map((g) => g.value)
  }
}

As you can see granchildren groups are never included. I would expect nested groups to be included as Ecmascript Regex spec states

This text was originally generated from a template, then edited by hand. You can modify the template here.

mpkorstanje commented 1 year ago

This applies to every language implementation. But I don't see how this can be fixed without potentially breaking everything for everyone.

Other than Ecmascript non-compliance, do you have a concrete example of how this prevents you from doing something?

notaphplover commented 1 year ago

Hi @mpkorstanje, yeah, sometimes both root and non root capturing groups are convenient to have.

As an example, consider a list of names whose delimiter is the comma (,) char. The last elements would be separated by using the and keyword. A possible regex would be:

((\w+)(?:, (\w+))*(?: and (\w+))?)

Consider a root group is added due to the fact the group 0 is not passed to the transformer.

Consider this link to have a playground. Including nested capturing groups leads to a different implementation that not having them. Nested capturing groups, in this example, allow us to know the last two elements separated by the and delimiter if present. I am aware parser implementations can be proposed to correctly parse all the elements, but they would have to do additional checks to determine wheter or not there're two last elements separated by the keyword and. More sophisticated examples can be proposed, but I find this one simple enough to illustrate my issue.

Other than Ecmascript non-compliance, do you have a concrete example of how this prevents you from doing something?

Well, I would hazard to say all regex specs shares this basic nested capturing group feature, not only Ecmascript one.

But I don't see how this can be fixed without potentially breaking everything for everyone.

Breaking or fixing? I can't find a place in the docs in which it's said this is the expected behavior. If this is the expected behavior I would prefer to close this issue in favor of a new one with a proposal to change it, so everyone knows this is not a bug, but a feature instead

mpkorstanje commented 1 year ago

Thanks for the explanation. That is a good usecase indeed.

The fix would be a breaking change for anyone who currently has declared a regex with nested capture groups but only utilizes the top level groups programmatically. And even though not explicitly documented, it is a fairly observable part of the API for any user so we'd run into Hyrum's Law.

This would need some accommodation to allow the different cucumber implementations to introduce this as a non-breaking change. Something along the lines of Argument.getValueFromCaptureGroups. But I haven't checked the exact usage in the Cucumber implementations.

notaphplover commented 1 year ago

Hey @mpkorstanje, thank you so much for giving feedback.

The fix would be a breaking change for anyone who currently has declared a regex with nested capture groups but only utilizes the top level groups programmatically. And even though not explicitly documented, it is a fairly observable part of the API for any user so we'd run into Hyrum's Law.

I understand your concerns. I cannot think in a non breaking fix simple enough that would be better than releasing a major version so, would fixing this whenever you guys consider it's time for a major version make sense?

mpkorstanje commented 1 year ago

Not so lightly. While we could make a breaking change here and increase the major version, it would also be a breaking change for the different Cucumber implementations that use this library. They're each on their own independent life cycle -- coordinating this would be rather difficult. The alternative would be maintaining multiple versions of cucumber-expressions which isn't ideal either.

notaphplover commented 1 year ago

But that's going to happen with every bug / feature you have, right? Do you have any process to accomplish this? How do you deliver a new feature / fix a bug in this project?

mpkorstanje commented 1 year ago

For this specific problem, typically by introducing a second code path as a non-breaking change. And then deprecating the bugged code path, eventually removing it once all the Cucumber implementations have gone through their life cycle.

notaphplover commented 1 year ago

Then, what about having a RecursiveGroup extends Group class that is able to recursive look for capture groups and adding options on the user entrypoint (I don't know which one/s would be) for either using RecursiveGroup or Group? This way we could fix this that way and deprecate those options in favor to always use RecursiveGroup

cucumber / cucumber-expressions