Regex UpdateBumpalong logic is flawed and likely costing us perf

dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.

MIT License

15.16k stars 4.71k forks source link

We have an optimization that inserts an "UpdateBumpalong" node into the graph after a starting char loop: https://github.com/dotnet/runtime/blob/12a8819eee9865eb38bca6c05fdece1053102854/src/libraries/System.Text.RegularExpressions/src/System/Text/RegularExpressions/RegexNode.cs#L373-L384 The code handling this node currently sets base.runtextpos to the current position, the idea being that the normal bumpalong of +1 would otherwise end up just processing the same position we've already tried. However, it looks like our current logic for handling this is flawed, in particular for non-atomic greedy loops. The current logic (in all engines) does the equivalent of:

base.runtextpos = pos;

but when backtracking kicks in for a greedy loop, we'll end up setting base.runtextpos to the last match position, and then again to that position -1, and then again to that position -2, all the way back to the beginning, which means it's not actually buying us anything in this case (in contrast, for atomic loops where there isn't backtracking, it will successfully track the last matching position). It seems like the logic should actually be more like:

if (base.runtextpos < pos)
{
    base.runtextpos = pos;
}

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Text.RegularExpressions`, `tenet-performance`
Milestone:	7.0.0

Author:

stephentoub

Assignees:

Labels:

`area-System.Text.RegularExpressions`, `tenet-performance`

Milestone:

7.0.0

dotnet / runtime

Regex UpdateBumpalong logic is flawed and likely costing us perf #62676