Recursive Regexp Replace on $

ProgerXP / Notepad2e

Word highlighting, simultaneous editing, split views, math evaluation, un/grep, comment reformatting, UAC elevation, complete regexps (PCRE), Lua lexers, DPI awareness and more (XP+)

Other

375 stars 52 forks source link

Recursive Regexp Replace on $ #307

Open ProgerXP opened 4 years ago

ProgerXP commented 4 years ago

Document:

aa
bb

Run Replace from $ to \n in regexp mode - the program will loop forever. Desired result:

aa

bb

cshnik commented 4 years ago

Additional cases:

Search string: $ Replace with: \n\1\1\n

aa

bb

Search string: (.*)$ Replace with: \n\1\1\n


aaaa

bbbb

cshnik commented 4 years ago

Fixed.

ProgerXP commented 4 years ago

`const int isRegexEOL = (szFind2[iFind2Length - 1] == '$') && ((iFind2Length < 2) || (szFind2[iFind2Length - 2] != '\\'));`

We should never try to parse regexps manually. This is not a universal fix. Can't this be fixed properly?

cshnik commented 4 years ago

I'm not aware of the way how this could be done without manual target range adjustment due to the following reason:

RegEx works in line-by-line mode (see BoostRegexSearch::FindTextImpl()).
$ is not EOL char, but the last position on line.

Thus, regex replace with a string containing \n will not automatically adjust the target range (will not skip the EOL from the current line) which caused specified hang.

Current fix allow regex to deal with multiline replace strings by manual next range calculation:

If replace string contains EOLs then we should skip specific number of lines after regex replace.
If find string has trailing $ than we should skip additional EOL (the one mentioned in the previous paragraph).

ProgerXP commented 4 years ago

It's true that $ is "0 characters long". However, we can go from the other side: instead of parsing the regexp (which we can never do correctly), see if there are line breaks in the replacement string and add this number of bytes to the match position.

This might fail if line breaks are retrieved from a capture group ((\n) -> $1) but this is a much more rare case than various combinations of regexp containing $ ($|foo, (foo|$), etc. etc.).

cshnik commented 4 years ago

instead of parsing the regexp (which we can never do correctly)
`const int isRegexEOL = (szFind2[iFind2Length - 1] == '$') && ((iFind2Length < 2) || (szFind2[iFind2Length - 2] != '\\'));`

This is not actual parsing. We only should detect if trailing character is unquoted $:

Last char is $
There is no other chars OR previous char is not \

This might fail if line breaks are retrieved from a capture group ((\n) -> $1) but this is a much more rare case than various combinations of regexp containing $ ($|foo, (foo|$), etc. etc.).

None of your regex samples will trigger isRegexEOL = true-logic.
The problem remains (hang on recursive regex replace) when using my samples from above.

Please confirm if current solution does not fit.

cshnik commented 4 years ago

Update: there is a missing case in current isRegexEOL detection logic: \\$ - regex to find line consisting of single \ will not be correctly processed since isRegexEOL=false.

Additional condition should be added: number of serial \ before $ should be even.

ProgerXP commented 4 years ago

Another bug that likely has same root cause:

a@b@c
a@b@c

Do Replace from ^[^@]+@ to empty string in regexp mode. Result:

c
c

Instead of:

b@c
b@c

In other words, regexp must never be applied more than once to the same region in the source file. Here it is applied to each line in a loop until there are no more matches in that line. This is incorrect, check tools like sed that do it properly:

(echo a@b@c; echo a@b@c) | sed -r 's/^[^@]+@//g'

cshnik commented 4 years ago

Additional test cases:

a@b@c
1
abc@b@c
2

Replace from: ^[^@]+@.*$ Replace to: \n

Result:


1

2

Replace from: ^([^@]+@.*)$ Replace to: \1X\n\1

Result:

a@b@cX
a@b@c
1
abc@b@cX
abc@b@c
2

Fixed.

ProgerXP commented 4 years ago

Document:

aa bb
aa bb

Replace from .* to z in regexp mode. Result:

zz
z

Instead of (.* is greedy, matches entire line):

z
z

cshnik commented 4 years ago

There was a change in RESearchRange class (came from Scintilla 3.11.2) which caused specified incorrect behavior when using BoostRegExSearch-implementation. Fixed.

sergeevabc commented 3 years ago

Any news on when “upcoming release”, which includes this fix, will be available as a binary?

ProgerXP commented 3 years ago

@sergeevabc Any news on when “upcoming release”, which includes this fix, will be available as a binary?

These fixes are already in beta builds that you can download from here. However, this is really tricky, I am still running into multiple problems with current implementation and I'm considering rolling it back entirely and try addressing the original issue ("Replace on $") in some other way.

What is your use case?