Closed iSazonov closed 3 years ago
Tagging subscribers to this area: @eerhardt, @pgovind, @jeffhandley
See info in area-owners.md if you want to be subscribed.
Description: |
### Description
In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).
After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory |
---|---|
Author: | iSazonov |
Assignees: | - |
Labels: | `area-System.Text.RegularExpressions`, `tenet-performance`, `untriaged` |
Milestone: | - |
How many replacements are being performed? If we're seeing tons of ReadOnlyMemory<char>[]
being created from Regex.Replace
, that suggests the size of the array is too big for ArrayPool
to cache, which suggests over a million replacements are being performed as part of a single Regex.Replace
?
Also, can you put together a standalone, simple C# repro focused on direct use of Regex.Replace? Such investigations are much easier when they come in such a form.
For the PerfVew results I used a data file with 100000 lines (it is first lines from original file shared in PowerShell issue by author) and the test script has 54 replace operators called per line. Line size is < 70 chars.
I could try to prepare C# code but only tomorrow (my time zone is UTC+5).
I could try to prepare C# code but only tomorrow (my time zone is UTC+5).
That'd be helpful. Thanks.
the test script has 54 replace operators called per line. Line size is < 70 chars
It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters? If so this is very surprising. Is there any parallelism being employed, or is all use of regex sequential?
the test script has 54 replace operators called per line. Line size is < 70 chars
It's calling Regex.Replace per line, so the input to each Replace call is at most 70 characters?
Here statistics for the input file:
Count Length
----- ----
1 0
2 1
14 5
135 6
226 7
539 8
1029 9
1487 10
2058 11
3904 12
4415 13
4255 14
5490 15
4947 16
5059 17
4703 18
5490 19
6525 20
4354 21
7114 22
3285 23
4032 24
2443 25
1532 26
8881 27
2661 28
1747 29
1498 30
1179 31
2989 32
684 33
917 34
1195 35
502 36
439 37
313 38
364 39
296 40
237 41
147 42
405 43
85 44
185 45
263 46
73 47
83 48
85 49
62 50
54 51
99 52
221 53
40 54
33 55
90 56
95 57
22 58
41 59
106 60
275 61
8 62
24 63
10 64
14 65
58 66
14 67
8 68
11 69
12 70
13 71
6 72
76 73
3 74
11 75
252 76
1 77
6 78
2 79
1 80
4 81
5 82
5 83
4 84
3 85
3 86
1 88
1 89
1 90
2 91
1 92
1 93
1 94
1 95
1 100
2 105
2 106
1 107
1 108
1 109
1 110
1 111
2 112
3 113
4 114
1 115
1 118
3 121
1 124
3 125
1 126
1 132
1 141
1 253
Here statistics for the input file:
Thanks, but to clarify my question, it's invoking Regex.Replace once per input line per replacement pattern, and every string input to Regex.Replace is going to be relatively short, e.g. less than a few hundred characters? I suspect that's not actually the case, or if it is, something is very wrong.
The script read the data file a line by line, for each line it applies 54 replace operations - each operation is applied to the result of the previous one. Most of lines is short (max size 253 for 1 line).
@stephentoub The repro script calls -replace
(which calls to regex.Replace(string, string)
) 54 times per input line with different replacement patterns.
# Removes junk lines of text
$data = Get-Content $aioTextFile -Encoding UTF8
# Removes some substrings within the data
$data = $data | ForEach-Object { `
$_ -replace '127\.0\.0\.1', '' -replace '0\.0\.0\.0', '' `
-replace ' ', '' -replace '\t', '' -replace '\|\|', '' `
-replace '\^', '' -replace 'www\.', '' `
-replace '#Shock-Porn', '' -replace '#AAAEcommerce', '' `
-replace '#(Vietnam)', '' -replace '#Acoustic', '' `
-replace '#Albert', '' -replace '#Admedo', '' `
-replace '#AdTriba', '' -replace '#AgentStudio', '' `
-replace '#Algolia', '' -replace '#Race', '' `
-replace '#Email,SMS,Popups', '' -replace '#Appier', '' `
-replace '#ZineOne', '' -replace '#DeadBabies(Anti-Abortion)', '' `
-replace '#AdobeDigitalMarketingSuite', '' -replace '#Race', '' `
-replace '#Attentive', '' -replace '#Auctiva', '' `
-replace '#Beelert', '' -replace '#Granify', '' `
-replace '#Gore', '' -replace '#ExtremistGroup(Uncertain)', '' `
-replace '#SixBit', '' -replace '#DigitalMarketing', '' `
-replace '#Race/Gender(SimonSheppard)', '' -replace '#BoldCommerce', '' `
-replace '#BridgeWell', '' -replace '#BuckSense', '' `
-replace '#Candid.io', '' -replace '#SailThru', '' `
-replace '#PCM,Inc.', '' -replace '#Frooition', '' `
-replace '#RichContext', '' -replace '#SearchSpring', '' `
-replace '#SendPulse', '' -replace '#Kibo', '' `
-replace '#Satanism', '' -replace '#ClearLink', '' `
-replace '#CleverTap', '' -replace '#ClickFirstMarketing', '' `
-replace '\&visitor=\*\&referrer=', '' -replace '://', '' `
-replace '\:\:1ip6-localhostip6-loopbacklocalhost6localhost6\.localdomain6', '' `
-replace '\:\:1localhost', '' -replace '\?action=log_promo_impression', '' `
-replace '\[hidethisheader(viewthelistonitsown)', '' `
-replace '\[viewthelistinadifferentformat\|aboutthislist\|viewtheblocklistofIPaddresses', ''
}
Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace
API in those cases.
That would also be a case where it would be nice to have an API to avoid allocating intermediate strings.
if it is, something is very wrong.
Yeah, something was very wrong: when there were no replacements to be made, we were neglecting to return a buffer to the pool. I'll put up a PR.
Looking at the patterns, most of them (except where parens are involved) seem to be looking for exact string literals? I wonder if a possible optimization could be for Regex to realize that the input pattern doesn't contain anything "interesting" from a regex matching perspective and to forward down to the normal string.Replace API in those cases.
It's possible, if the regex parse tree contains just a single character or a "multi" (just a text string), if the options are configured appropriately (default would be appropriate), and with the string-based overload (rather than the callback-based overload). I did a quick hack to see what it would save, and for this specific repro, it would appx double throughput.
if the regex parse tree contains just a single character or a "multi"
I was going through this just yesterday. This would go down the Boyer Moore path right? I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM. I'm wondering if our BM implementation can be optimized/vectorized further. Anyway, this is just an aside, I'm looking at your PR now :)
I was going through this just yesterday. This would go down the Boyer Moore path right?
It does, yes.
I'm guessing the speedup is because vectorized string search is much better than non-vectorized BM.
That's likely a contributor, though for many of these the advance distance is pretty large, so I suspect it's not the bulk of the difference. There's non-trivial code to traverse through to get from the public Regex.Replace down to the inner Boyer-Moore search loop in FindFirstChar; if I had to guess, I'd wager a non-trivial chunk of the difference came from my just replacing Regex.Replace with string.Replace in my hack, and thus skipping all of that overhead just getting to the place where the search happens.
@stephentoub I see you found and fixed one problem. Thanks! In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.
In my original test, the result was 112 seconds in all 5.0 preview versions, but in the final version it was over 300 seconds. I'm afraid there was something else changed that we still haven't covered here.
What test?
I mean https://github.com/PowerShell/PowerShell/issues/14087#issuecomment-728748420 - the same test on all preview versions ~112sec, on GA ~307sec.
Does PowerShell execute the regex exactly as provided, or does it prepend anything to it, like .* ?
I don't see in PowerShell code any modifications of input regex string for replace operator.
In that case I don't know why there would be any difference btw previews and release. If you can come up with a repro for a different regression from 3.1, please open a new issue. Thanks.
@stephentoub When can we get # 44833? Will it be in 6.0 Preview1? After that we could re-run the test in PowerShell and feedback if needed.
@iSazonov, it's already merged into master, so any 6.0 preview would have it; it should also be in nightly builds starting today if you wanted to try running with one of those. And https://github.com/dotnet/runtime/pull/44837 is for possibly getting it into 5.0.x servicing.
Returning to my original comment -- if this pattern of heavily chained replacements is a Powershell idiom, I could imagine Powershell could recognize it and use some API optimized for this purpose (to avoid intermediates). Perhaps that would look like part of https://github.com/dotnet/runtime/issues/23602 which we already discussed.
@danmosemsft Most PowerShell users are not experienced developers (not even PowerShell) and can create the most incredible constructs. Of course, a chain of 54 substitution operators is not a typical case, but from my experience I would say that such chains are used - this is one of the simple and pleasant features of PowerShell. If we think that this is how users process modern logs, which are tens of gigabytes in size, then performance is undoubtedly important. If there will be a new API that allows chaining, then PowerShell will not be able to immediately benefit from it - we will need to either improve the PowerShell engine or the operator itself. Also we could think about search a string by one pattern and then replace in the result string by another pattern.
I'd speculate that Regex in common is too complex for end users and most of Regex pattern PowerShell users create is very, very simple. If we look the 54 patterns they all is simple. I guess .Net team could use the fact to create allocationless Regex optimizations.
I guess .Net team could use the fact to create allocationless Regex optimizations.
The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).
I guess .Net team could use the fact to create allocationless Regex optimizations.
The allocation here is the string returned from Regex.Replace; if anything is replaced, it has to return a new string (it returns the original string if nothing is replaced).
My note was for chained scenario.
I can confirm the fix works. (I don't compile PowerShell with nightly .Net 6.0 build and only copy-paste the dlls - after that the original script works as fast as with .Net 5.0.)
Description
In PowerShell repository we get a report that an user script works significant slower on PowerShell 7.1 (based on .Net 5.0) vs PowerShell 7.0 (based on .Net 3.1).
After some investigations we see the script makes many calls of Regex.Replace(string, string) method and chokes GC with ReadOnlyMemory.
Please see https://github.com/PowerShell/PowerShell/issues/14087 for details. PerfView files are attached there.
Regression?
Yes, it is regression in .Net 5.0 from .Net 3.1.