dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.36k stars 4.74k forks source link

Huge performance regression in Regex.Replace(string, string) #44808

Closed iSazonov closed 3 years ago

iSazonov commented 3 years ago

Description

In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).

After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.

Please see https://github.com/PowerShell/PowerShell/issues/14087 for details. PerfView files are attached there.

Regression?

Yes, it is regression in .Net 5.0 from .Net 3.1.

ghost commented 3 years ago

Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley See info in area-owners.md if you want to be subscribed.

Issue Details
Description: ### Description In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1). After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory. Please see https://github.com/PowerShell/PowerShell/issues/14087 for details. PerfView files are attached there. ### Regression? Yes, it is regression in .Net 5.0 from .Net 3.1.
Author: iSazonov
Assignees: -
Labels: `area-System.Text.RegularExpressions`, `tenet-performance`, `untriaged`
Milestone: -

stephentoub commented 3 years ago

How many replacements are being performed? If we're seeing tons of ReadOnlyMemory<char>[] being created from Regex.Replace, that suggests the size of the array is too big for ArrayPool to cache, which suggests over a million replacements are being performed as part of a single Regex.Replace?

stephentoub commented 3 years ago

Also, can you put together a standalone, simple C# repro focused on direct use of Regex.Replace? Such investigations are much easier when they come in such a form.

iSazonov commented 3 years ago

For the PerfVew results I used a data file with 100000 lines (it is first lines from original file shared in PowerShell issue by author) and the test script has 54 replace operators called per line. Line size is < 70 chars.

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

stephentoub commented 3 years ago

I could try to prepare C# code but only tomorrow (my time zone is UTC+5).

That'd be helpful. Thanks.

stephentoub commented 3 years ago

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters? If so this is very surprising. Is there any parallelism being employed, or is all use of regex sequential?

iSazonov commented 3 years ago

the test script has 54 replace operators called per line. Line size is < 70 chars

It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters?

Here statistics for the input file:

Count Length
----- ----
    1 0
    2 1
   14 5
  135 6
  226 7
  539 8
 1029 9
 1487 10
 2058 11
 3904 12
 4415 13
 4255 14
 5490 15
 4947 16
 5059 17
 4703 18
 5490 19
 6525 20
 4354 21
 7114 22
 3285 23
 4032 24
 2443 25
 1532 26
 8881 27
 2661 28
 1747 29
 1498 30
 1179 31
 2989 32
  684 33
  917 34
 1195 35
  502 36
  439 37
  313 38
  364 39
  296 40
  237 41
  147 42
  405 43
   85 44
  185 45
  263 46
   73 47
   83 48
   85 49
   62 50
   54 51
   99 52
  221 53
   40 54
   33 55
   90 56
   95 57
   22 58
   41 59
  106 60
  275 61
    8 62
   24 63
   10 64
   14 65
   58 66
   14 67
    8 68
   11 69
   12 70
   13 71
    6 72
   76 73
    3 74
   11 75
  252 76
    1 77
    6 78
    2 79
    1 80
    4 81
    5 82
    5 83
    4 84
    3 85
    3 86
    1 88
    1 89
    1 90
    2 91
    1 92
    1 93
    1 94
    1 95
    1 100
    2 105
    2 106
    1 107
    1 108
    1 109
    1 110
    1 111
    2 112
    3 113
    4 114
    1 115
    1 118
    3 121
    1 124
    3 125
    1 126
    1 132
    1 141
    1 253
stephentoub commented 3 years ago

Here statistics for the input file:

Thanks, but to clarify my question, it's invoking Regex.Replace once per input line per replacement pattern, and every string input to Regex.Replace is going to be relatively short, e.g. less than a few hundred characters? I suspect that's not actually the case, or if it is, something is very wrong.

iSazonov commented 3 years ago

The script read the data file a line by line, for each line it applies 54 replace operations - each operation is applied to the result of the previous one. Most of lines is short (max size 253 for 1 line).

daxian-dbw commented 3 years ago

@stephentoub The repro script calls -replace (which calls to regex.Replace(string, string)) 54 times per input line with different replacement patterns.

        # Removes junk lines of text
        $data = Get-Content $aioTextFile -Encoding UTF8

        # Removes some substrings within the data
        $data = $data | ForEach-Object { `
                $_ -replace '127\.0\.0\.1', '' -replace '0\.0\.0\.0', '' `
                -replace ' ', '' -replace '\t', '' -replace '\|\|', '' `
                -replace '\^', '' -replace 'www\.', '' `
                -replace '#Shock-Porn', '' -replace '#AAAEcommerce', '' `
                -replace '#(Vietnam)', '' -replace '#Acoustic', '' `
                -replace '#Albert', '' -replace '#Admedo', '' `
                -replace '#AdTriba', '' -replace '#AgentStudio', '' `
                -replace '#Algolia', '' -replace '#Race', '' `
                -replace '#Email,SMS,Popups', '' -replace '#Appier', '' `
                -replace '#ZineOne', '' -replace '#DeadBabies(Anti-Abortion)', '' `
                -replace '#AdobeDigitalMarketingSuite', '' -replace '#Race', '' `
                -replace '#Attentive', '' -replace '#Auctiva', '' `
                -replace '#Beelert', '' -replace '#Granify', '' `
                -replace '#Gore', '' -replace '#ExtremistGroup(Uncertain)', '' `
                -replace '#SixBit', '' -replace '#DigitalMarketing', '' `
                -replace '#Race/Gender(SimonSheppard)', '' -replace '#BoldCommerce', '' `
                -replace '#BridgeWell', '' -replace '#BuckSense', '' `
                -replace '#Candid.io', '' -replace '#SailThru', '' `
                -replace '#PCM,Inc.', '' -replace '#Frooition', '' `
                -replace '#RichContext', '' -replace '#SearchSpring', '' `
                -replace '#SendPulse', '' -replace '#Kibo', '' `
                -replace '#Satanism', '' -replace '#ClearLink', '' `
                -replace '#CleverTap', '' -replace '#ClickFirstMarketing', '' `
                -replace '\&visitor=\*\&referrer=', '' -replace '://', '' `
                -replace '\:\:1ip6-localhostip6-loopbacklocalhost6localhost6\.localdomain6', '' `
                -replace '\:\:1localhost', '' -replace '\?action=log_promo_impression', '' `
                -replace '\[hidethisheader(viewthelistonitsown)', '' `
                -replace '\[viewthelistinadifferentformat\|aboutthislist\|viewtheblocklistofIPaddresses', ''
        }
GrabYourPitchforks commented 3 years ago

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

danmoseley commented 3 years ago

That would also be a case where it would be nice to have an API to avoid allocating intermediate strings.

stephentoub commented 3 years ago

if it is, something is very wrong.

Yeah, something was very wrong: when there were no replacements to be made, we were neglecting to return a buffer to the pool. I'll put up a PR.

stephentoub commented 3 years ago

Fix is at https://github.com/dotnet/runtime/pull/44833

stephentoub commented 3 years ago

Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.

It's possible, if the regex parse tree contains just a single character or a "multi" (just a text string), if the options are configured appropriately (default would be appropriate), and with the string-based overload (rather than the callback-based overload). I did a quick hack to see what it would save, and for this specific repro, it would appx double throughput.

pgovind commented 3 years ago

if the regex parse tree contains just a single character or a "multi"

I was going through this just yesterday. This would go down the Boyer Moore path right? I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM. I'm wondering if our BM implementation can be optimized/vectorized further. Anyway, this is just an aside, I'm looking at your PR now :)

stephentoub commented 3 years ago

I was going through this just yesterday. This would go down the Boyer Moore path right?

It does, yes.

I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM.

That's likely a contributor, though for many of these the advance distance is pretty large, so I suspect it's not the bulk of the difference. There's non-trivial code to traverse through to get from the public Regex.Replace down to the inner Boyer-Moore search loop in FindFirstChar; if I had to guess, I'd wager a non-trivial chunk of the difference came from my just replacing Regex.Replace with string.Replace in my hack, and thus skipping all of that overhead just getting to the place where the search happens.

iSazonov commented 3 years ago

@stephentoub I see you found and fixed one problem. Thanks! In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

stephentoub commented 3 years ago

In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.

What test?

iSazonov commented 3 years ago

I mean https://github.com/PowerShell/PowerShell/issues/14087#issuecomment-728748420 - the same test on all preview versions ~112sec, on GA ~307sec.

stephentoub commented 3 years ago

Does PowerShell execute the regex exactly as provided, or does it prepend anything to it, like .* ?

iSazonov commented 3 years ago

I don't see in PowerShell code any modifications of input regex string for replace operator.

stephentoub commented 3 years ago

In that case I don't know why there would be any difference btw previews and release. If you can come up with a repro for a different regression from 3.1, please open a new issue. Thanks.

iSazonov commented 3 years ago

@stephentoub When can we get # 44833? Will it be in 6.0 Preview1? After that we could re-run the test in PowerShell and feedback if needed.

stephentoub commented 3 years ago

@iSazonov, it's already merged into master, so any 6.0 preview would have it; it should also be in nightly builds starting today if you wanted to try running with one of those. And https://github.com/dotnet/runtime/pull/44837 is for possibly getting it into 5.0.x servicing.

danmoseley commented 3 years ago

Returning to my original comment -- if this pattern of heavily chained replacements is a Powershell idiom, I could imagine Powershell could recognize it and use some API optimized for this purpose (to avoid intermediates). Perhaps that would look like part of https://github.com/dotnet/runtime/issues/23602 which we already discussed.

iSazonov commented 3 years ago

@danmosemsft Most PowerShell users are not experienced developers (not even PowerShell) and can create the most incredible constructs. Of course, a chain of 54 substitution operators is not a typical case, but from my experience I would say that such chains are used - this is one of the simple and pleasant features of PowerShell. If we think that this is how users process modern logs, which are tens of gigabytes in size, then performance is undoubtedly important. If there will be a new API that allows chaining, then PowerShell will not be able to immediately benefit from it - we will need to either improve the PowerShell engine or the operator itself. Also we could think about search a string by one pattern and then replace in the result string by another pattern.

I'd speculate that Regex in common is too complex for end users and most of Regex pattern PowerShell users create is very, very simple. If we look the 54 patterns they all is simple. I guess .Net team could use the fact to create allocationless Regex optimizations.

stephentoub commented 3 years ago

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

iSazonov commented 3 years ago

I guess .Net team could use the fact to create allocationless Regex optimizations.

The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).

My note was for chained scenario.


I can confirm the fix works. (I don't compile PowerShell with nightly .Net 6.0 build and only copy-paste the dlls - after that the original script works as fast as with .Net 5.0.)