dart-lang / sdk

The Dart SDK, including the VM, JS and Wasm compilers, analysis, core libraries, and more.
https://dart.dev
BSD 3-Clause "New" or "Revised" License
10.08k stars 1.56k forks source link

RegEx: add a way to get the positions of groups #42307

Open ethanblake4 opened 4 years ago

ethanblake4 commented 4 years ago

When executing a RegExp, groups are returned as a map of Strings. However, for some use cases it is really important to get the index of where these groups are in the input string, rather than just the text.

For example, I have the following code:

final pattern = RegExp('(?:(?<=[^\\\\])|^){{(\\w*)}}');
final match = pattern.firstMatch('A captured word I {{capture}}');
print(match.group(1));

This correctly prints 'capture'. However, there is no way to know (at least, in my use case where I am being fed regexps from an external file) whether this group is referring to 'capture' in the source string at span 2-10 or 20-28. In this case the answer is 20-28 and that is what I would like to be able to retrieve.

In other languages: In python you would call match.span(1) which would return a tuple of the start and end position. In Dart this could be replaced with an object. In JavaScript this is not supported, meaning this feature would probably not be supported in dart2js.

lrhn commented 4 years ago

We are deliberately JavaScript RegExp compatible because Dart is compiled to JavaScript, and the JavaScript RegExp doesn't provide that information. Adding a feature which won't work when compiled to JavaScript requires a very compelling argument. I don't see that happening any time soon.

For your use case, you can find the start/end easily because it's match.start+2/match.end-2. I know it's not always that easy.

(Also, your RegExp can be abbreviated to RegExp(r'(?<!\\){{(\w+)}}'), although I'd probably escape the { characters just to be sure).

ethanblake4 commented 4 years ago

That use case was just an example, in reality I will be getting a lot of regexes from a file with a predefined syntax so I can't just modify them to work differently.

Here's a JS library that provides this functionality by extending the RegEx class: http://www--s0-v1.becke.ch/tool/becke-ch--regex--s0-v1/becke-ch--regex--s0-0-v1--homepage--pl--client/

Something like that could be written for Dart too, but I'd imagine it is significantly slower than just getting the information directly from the Regex engine. But perhaps you could use something like that to make it work in dart2js.

I think this is a very noticeable omission from the dart stdlib as this functionality is in nearly every modern language apart from JS: C#, Python, Ruby, Java, Rust, Go, Kotlin... in fact I can't find another language besides JS that omits this. Several of those languages also compile to JS, taking varying approaches of either omitting the functionality for the JS target or implementing their own regex engines which support it (usually in WASM).

mw66 commented 11 months ago

This omission is stupid.

Even ChatGpt can do this better:

To get the end position of the second capturing group in Dart, you can use the end property of the Match object, but you need to be aware that capturing group indices are 1-based. Here's how you can do it:

dart
Copy code
void main() {
  RegExp regex = RegExp(r'(\d+) (\w+)');
  String text = '123 example 456 test';

  Match match = regex.firstMatch(text);

  if (match != null) {
    // Get the start and end positions of the second capturing group (index 2)
    int start = match.start(2);
    int end = match.end(2);

    print('Second capturing group start position: $start'); // Prints 'Second capturing group start position: 4'
    print('Second capturing group end position: $end');     // Prints 'Second capturing group end position: 11'
  }
}
In this example, we define a regular expression RegExp(r'(\d+) (\w+)'), which captures two groups. To get the start and end positions of the second capturing group (index 2), we use match.start(2) and match.end(2).

Keep in mind that capturing group indices are 1-based, so the first capturing group has an index of 1, the second capturing group has an index of 2, and so on.
mw66 commented 11 months ago

This is what I end up doing:

line = line.substring(match!.start + match!.group(1)!.length + match!.group(2)!.length);
lrhn commented 11 months ago

JavaScript RegExps now allow access to the capture group indices when using the d flag. That makes it possible for Dart to do the same.

We can add a similar indices array or method to RegExpMatch.

I'd prefer to refactor Match to be more homogenous and Dart-like, rather than add another weird method.

matthew-carroll commented 8 months ago

Now that the core blocker has been removed (JS support), will this work definitely be done? Does it have a priority in the backlog?

mraleph commented 8 months ago

I'd prefer to refactor Match to be more homogenous and Dart-like, rather than add another weird method.

@lrhn do you have a concrete idea of how you would prefer API to look like?

@matthew-carroll

will this work definitely be done?

If somebody does it.

We welcome patches as well - this is a fairly straightforward thing to implement because internally regexp engine is representing groups by their start and end indices anyway. So all the necessary work is around changing Dart code to expose this information.

The biggest question is the API design - but I am sure @lrhn can provide a sketch.

lrhn commented 8 months ago

To just add this feature, I'd add a way to get a Match object for RegExp match captures.

There are several ways to do that.

The very-small increment approach would be just adding to RegExpMatch:

int groupStart(int groupNumber);
int groupEnd(int groupNumber);

That is incredibly simple, and requires no extra allccation, but it's also not a great API.

(int start, int end) groupLimits(groupNumber);

is not much better, but does introduce an allocation, so it's like the worst of both worlds.

I'd prefer to not add yet-another partial way to access a capture. Rather, I'd introduce something to represent the entire capture, start, end and match string, like a mini-Match, except that I don't want the extra baggage of Match. (I don't really need a reference to the pattern here. But then, neither does Match. Can't remember ever using it, and not being in a context where I could just have referred to the pattern directly).

Consider something like

interface class MatchSlice {
  final String source
  final int start;
  final int end;
  String? _slice;
  MatchSlice(this.source, this.start, this.end);
  String get match => slice ??= source.substring(start, end);
}

// We can make `Match` implement `MatchSlice`, which will give it the `.match` which is the non-nullable
// version of `[0]` that we're sorely lacking. That is breaking, though, so maybe skip it initially.

abstract interface class Match implements MatchSlice {
  ///
  String get match => this[0]!;
}

// Then add to `RegExpMatch`

final class RegExpMatch implements Match {
  // ...

  /// The capture groups of this match.
  ///
  /// An unmodifiable list of slices for each capture group of this 
  /// regular expression which participated in the match.
  ///
  /// The list has length [groupCount] + 1, and has an entry for each 
  /// capture group of the regular expression, plus an entry for the 
  /// entire match, treated as capture group zero.
  /// The entry for a capture is `null` if the capture did not participate in the
  /// entire match.
  /// The entry at index zero is always this `RegExpMatch`. The remaning
  /// entries are not `RegExpMatch`es, just plain `MatchSlice` objects.
  List<MatchSlice> get captures;

  /// ... same for named capture groups ...
  Map<String, MatchSlice> get namedCaptures;

That does mean accessing these captures will involve an allocation per match accessed (can be cached if accessed more than once), and maybe an allocation for captures and namedCaptures themselves, which can hopefully be inlined in most cases.

Longer-term, I'd like to remove group/groupCount/groups entirely, and remove [] from Match and only keep it on RegExpMatch. That means that only a RegExpMatch has captures, which will likely require making Pattern generic as interface class Pattern<M extends Match> { ... }, so that we can make it String.replaceAllMapped<M extends Match>(Pattern<M> pattern, String Function(M match) replace) { ... }, making "abc-def-ghi".replace(RegExp(r'(\w)(\w*)'), (m) => "${m[1]!.toUpperCase()}${m[2]}") type-check and infer RegExpMatch for m. (I hope this can work.)