Closed adamsitnik closed 1 year ago
Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.
I have changed the tag to jit for now as this is most likely not the UTF8Encoding code itself. if proven otherwise, please re-tag it with encoding label again.
@jeffhandley , @BruceForstall , @JulieLeeMSFT
Just adding a note here: Despite what the name suggests, the 4 regressions listed here are likely from methods in UTF8Utility and/or ASCIIUtility. The GetString
benchmark doesn't seem to show any improvements, but it's not straightforward to reverse the changes that this benchmark hits because the UTF8Utility and ASCIIUtility methods are highly coupled and they do show decent speedup in the other benchmarks.
@echesakovMSFT please look into this.
CC @AndyAyersMS
For the following simple repro
robox@DDARM64S-003:~/echesako/Runtime_41699$ cat Program.cs
using System.IO;
using System.Runtime.CompilerServices;
using System.Text;
namespace Runtime_41699
{
public class Program
{
public static void Main()
{
string unicode;
byte[] bytes;
UTF8Encoding utf8Encoding;
unicode = File.ReadAllText("/home/robox/echesako/Runtime_41699/EnglishMostlyAscii.txt");
utf8Encoding = new UTF8Encoding();
bytes = utf8Encoding.GetBytes(unicode);
while (true)
{
Consume(utf8Encoding.GetByteCount(unicode));
}
}
// public int GetByteCount() => _utf8Encoding.GetByteCount(_unicode);
// public byte[] GetBytes() => _utf8Encoding.GetBytes(_unicode);
// public string GetString() => _utf8Encoding.GetString(_bytes);
[MethodImpl(MethodImplOptions.NoInlining)]
private static void Consume<T>(in T _) { }
}
}
I am seeing 4 times more cache-misses in net5.0 and twice more stalled-cycles-backend. The following counter stat collections are done for the loop only.
robox@DDARM64S-003:~/echesako/Runtime_41699$ cat netcoreapp3.1-Runtime_41699.txt
# started on Fri Sep 4 11:37:51 2020
Performance counter stats for process id '35473':
351,895 branch-misses
5,009,414 cache-misses
77,972,050,461 cpu-cycles
157,542,185,470 instructions # 2.02 insn per cycle
# 0.03 stalled cycles per insn
66,244,057 stalled-cycles-frontend # 0.08% frontend cycles idle
4,396,095,861 stalled-cycles-backend # 5.64% backend cycles idle
30.004692431 seconds time elapsed
robox@DDARM64S-003:~/echesako/Runtime_41699$ cat net5.0-Runtime_41699.txt
# started on Fri Sep 4 11:39:04 2020
Performance counter stats for process id '35507':
270,958 branch-misses
21,878,633 cache-misses
77,971,943,239 cpu-cycles
98,498,800,981 instructions # 1.26 insn per cycle
# 0.10 stalled cycles per insn
58,217,904 stalled-cycles-frontend # 0.07% frontend cycles idle
9,625,981,589 stalled-cycles-backend # 12.35% backend cycles idle
30.005090846 seconds time elapsed
Combining PopCount
with GetNonAsciiBytes
in Utf16Utility.GetPointerToFirstInvalidChar
diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..73a1ea29bec 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()
Vector128<ushort> charIsThreeByteUtf8Encoded;
uint mask;
+ uint popcnt;
if (AdvSimd.IsSupported)
{
charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
- mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+ popcnt = GetNonAsciiBytesAndPopCount(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
}
else
{
charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+ popcnt = (uint)BitOperations.PopCount(mask);
}
// Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
// unpaired surrogates in our data. (Unpaired surrogates would invalidate
// our computed result and we'd have to throw it away.)
- uint popcnt = (uint)BitOperations.PopCount(mask);
-
// Surrogates need to be special-cased for two reasons: (a) we need
// to account for the fact that we over-counted in the addition above;
// and (b) they require separate validation.
@@ -485,6 +485,22 @@ static Utf16Utility()
return pInputBuffer;
}
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ private static uint GetNonAsciiBytesAndPopCount(Vector128<byte> value, Vector128<byte> bitMask128)
+ {
+ Debug.Assert(AdvSimd.Arm64.IsSupported);
+
+ Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
+ Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitMask128);
+
+ // self-pairwise add until all flags have moved to the first two bytes of the vector
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+ Vector128<byte> popcnt = AdvSimd.PopCount(extractedBits);
+ return AdvSimd.Arm64.AddPairwise(popcnt, popcnt).ToScalar();
+ }
+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
{
seems to help a little bit with stalled-cycles-backend
robox@DDARM64S-003:~/echesako/Runtime_41699$ perf stat -e "branch-misses,cache-misses,cpu-cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend" -p 35949 sleep 30
Performance counter stats for process id '35949':
287,604 branch-misses
24,786,695 cache-misses
77,971,487,139 cpu-cycles
95,056,853,183 instructions # 1.22 insn per cycle
# 0.07 stalled cycles per insn
58,614,087 stalled-cycles-frontend # 0.08% frontend cycles idle
6,650,628,910 stalled-cycles-backend # 8.53% backend cycles idle
30.005114026 seconds time elapsed
This avoid moving mask
back and forth between SIMD and general-purpose registers files.
Below measurement are done on
processor : 0
model name : ARMv8 Processor rev 1 (v8l)
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 1
processor : 1
model name : ARMv8 Processor rev 1 (v8l)
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 1
processor : 2
model name : ARMv8 Processor rev 1 (v8l)
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 1
processor : 3
model name : ARMv8 Processor rev 1 (v8l)
BogoMIPS : 38.40
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x1
CPU part : 0xd07
CPU revision : 1
.NET Core 3.1.6
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
Job-VJGWPE : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 3.1 Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 118.9 μs | 0.24 μs | 0.21 μs | 118.9 μs | 118.6 μs | 119.4 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 268.4 μs | 2.14 μs | 1.78 μs | 267.6 μs | 266.4 μs | 272.4 μs | 49.7881 | 49.7881 | 49.7881 | 163840 B |
GetString | EnglishAllAscii | 322.4 μs | 1.78 μs | 1.49 μs | 321.7 μs | 321.5 μs | 326.5 μs | 99.4898 | 99.4898 | 99.4898 | 327648 B |
GetByteCount | EnglishMostlyAscii | 328.3 μs | 0.43 μs | 0.39 μs | 328.1 μs | 328.0 μs | 329.3 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 680.3 μs | 0.80 μs | 0.62 μs | 680.3 μs | 679.3 μs | 681.5 μs | 51.6304 | 51.6304 | 51.6304 | 169880 B |
GetString | EnglishMostlyAscii | 591.1 μs | 2.17 μs | 1.92 μs | 590.2 μs | 588.7 μs | 594.6 μs | 99.5370 | 99.5370 | 99.5370 | 327656 B |
GetByteCount | Chinese | 149.9 μs | 0.29 μs | 0.26 μs | 149.9 μs | 149.7 μs | 150.5 μs | - | - | - | - |
GetBytes | Chinese | 646.8 μs | 1.21 μs | 0.95 μs | 647.0 μs | 645.1 μs | 647.9 μs | 55.0000 | 55.0000 | 55.0000 | 177752 B |
GetString | Chinese | 943.7 μs | 2.97 μs | 2.63 μs | 943.3 μs | 940.9 μs | 950.1 μs | 44.1176 | 44.1176 | 44.1176 | 150112 B |
GetByteCount | Cyrillic | 130.3 μs | 0.21 μs | 0.19 μs | 130.4 μs | 130.0 μs | 130.7 μs | - | - | - | - |
GetBytes | Cyrillic | 487.0 μs | 1.30 μs | 1.08 μs | 486.8 μs | 485.2 μs | 489.2 μs | 29.2969 | 29.2969 | 29.2969 | 100880 B |
GetString | Cyrillic | 648.8 μs | 1.74 μs | 1.45 μs | 649.5 μs | 646.6 μs | 650.9 μs | 39.0625 | 39.0625 | 39.0625 | 130856 B |
GetByteCount | Greek | 163.7 μs | 0.08 μs | 0.07 μs | 163.7 μs | 163.6 μs | 163.8 μs | - | - | - | - |
GetBytes | Greek | 723.1 μs | 3.83 μs | 3.58 μs | 721.8 μs | 718.7 μs | 728.8 μs | 39.7727 | 39.7727 | 39.7727 | 129248 B |
GetString | Greek | 968.3 μs | 8.42 μs | 7.47 μs | 965.3 μs | 960.8 μs | 980.7 μs | 47.7941 | 47.7941 | 47.7941 | 164264 B |
NET Core 5.0.0
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-OUMKUS : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.6 μs | 0.51 μs | 0.47 μs | 109.3 μs | 109.3 μs | 110.6 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 261.1 μs | 2.71 μs | 2.53 μs | 259.9 μs | 258.7 μs | 266.3 μs | 48.9583 | 48.9583 | 48.9583 | 163854 B |
GetString | EnglishAllAscii | 205.6 μs | 0.58 μs | 0.46 μs | 205.6 μs | 204.8 μs | 206.3 μs | 99.5066 | 99.5066 | 99.5066 | 327677 B |
GetByteCount | EnglishMostlyAscii | 565.9 μs | 0.93 μs | 0.78 μs | 565.9 μs | 564.2 μs | 567.2 μs | - | - | - | 1 B |
GetBytes | EnglishMostlyAscii | 912.2 μs | 1.65 μs | 1.29 μs | 912.0 μs | 910.3 μs | 914.8 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 574.2 μs | 1.69 μs | 1.32 μs | 573.6 μs | 572.5 μs | 576.9 μs | 99.5370 | 99.5370 | 99.5370 | 327685 B |
GetByteCount | Chinese | 258.3 μs | 0.83 μs | 0.77 μs | 257.9 μs | 257.5 μs | 259.8 μs | - | - | - | - |
GetBytes | Chinese | 749.3 μs | 3.55 μs | 3.14 μs | 747.9 μs | 746.7 μs | 756.1 μs | 53.5714 | 53.5714 | 53.5714 | 177768 B |
GetString | Chinese | 896.2 μs | 9.65 μs | 9.03 μs | 891.6 μs | 889.5 μs | 914.0 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 223.8 μs | 0.46 μs | 0.39 μs | 223.6 μs | 223.4 μs | 224.8 μs | - | - | - | - |
GetBytes | Cyrillic | 592.1 μs | 4.66 μs | 4.13 μs | 590.1 μs | 588.8 μs | 602.4 μs | 30.0926 | 30.0926 | 30.0926 | 100889 B |
GetString | Cyrillic | 630.9 μs | 1.32 μs | 1.17 μs | 630.6 μs | 629.6 μs | 633.7 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 281.7 μs | 0.35 μs | 0.31 μs | 281.8 μs | 281.2 μs | 282.3 μs | - | - | - | - |
GetBytes | Greek | 844.0 μs | 2.50 μs | 2.09 μs | 843.5 μs | 841.8 μs | 848.9 μs | 39.4737 | 39.4737 | 39.4737 | 129260 B |
GetString | Greek | 951.6 μs | 10.59 μs | 9.91 μs | 949.3 μs | 941.3 μs | 971.0 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
NET Core 5.0.0 (with the suggested change to Utf16Utility.GetPointerToFirstInvalidChar
)
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-XEOYJB : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.2 μs | 0.46 μs | 0.41 μs | 109.0 μs | 108.8 μs | 110.0 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 260.3 μs | 2.08 μs | 1.74 μs | 259.7 μs | 259.0 μs | 265.2 μs | 48.9583 | 48.9583 | 48.9583 | 163854 B |
GetString | EnglishAllAscii | 207.3 μs | 1.23 μs | 1.03 μs | 206.9 μs | 206.3 μs | 209.7 μs | 99.5066 | 99.5066 | 99.5066 | 327677 B |
GetByteCount | EnglishMostlyAscii | 414.5 μs | 0.49 μs | 0.41 μs | 414.3 μs | 414.3 μs | 415.6 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 753.2 μs | 3.51 μs | 2.93 μs | 752.4 μs | 750.2 μs | 759.9 μs | 50.5952 | 50.5952 | 50.5952 | 169895 B |
GetString | EnglishMostlyAscii | 574.0 μs | 8.49 μs | 7.94 μs | 570.8 μs | 566.9 μs | 590.0 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 189.1 μs | 0.16 μs | 0.14 μs | 189.1 μs | 189.0 μs | 189.5 μs | - | - | - | - |
GetBytes | Chinese | 675.3 μs | 1.07 μs | 0.89 μs | 675.1 μs | 673.9 μs | 677.0 μs | 54.3478 | 54.3478 | 54.3478 | 177768 B |
GetString | Chinese | 895.7 μs | 6.66 μs | 6.23 μs | 892.0 μs | 889.4 μs | 904.7 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 164.1 μs | 0.11 μs | 0.09 μs | 164.1 μs | 164.0 μs | 164.3 μs | - | - | - | - |
GetBytes | Cyrillic | 527.7 μs | 2.71 μs | 2.40 μs | 527.0 μs | 525.4 μs | 533.3 μs | 29.1667 | 29.1667 | 29.1667 | 100889 B |
GetString | Cyrillic | 624.7 μs | 2.56 μs | 2.00 μs | 624.6 μs | 622.2 μs | 630.4 μs | 38.4615 | 38.4615 | 38.4615 | 130868 B |
GetByteCount | Greek | 206.9 μs | 0.44 μs | 0.39 μs | 206.7 μs | 206.6 μs | 208.0 μs | - | - | - | - |
GetBytes | Greek | 764.2 μs | 3.49 μs | 3.27 μs | 762.9 μs | 760.4 μs | 769.6 μs | 38.6905 | 38.6905 | 38.6905 | 129260 B |
GetString | Greek | 962.2 μs | 4.29 μs | 3.59 μs | 961.0 μs | 958.4 μs | 970.9 μs | 46.8750 | 46.8750 | 46.8750 | 164279 B |
It's clear from the data for GetByteCount benchmark the issue with stalled cycles due to PopCount
is one of potentially many causes of the performance regression here. We need to do thorough analysis to discover them all.
I am moving this to .NET 6.0. I don't believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding.
cc @JulieLeeMSFT @jeffhandley
Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.
@jeffhandley I am assigning this to you now.
Thanks @echesakovMSFT for your analysis.
CC @GrabYourPitchforks
One more observation - if I remove the code under if ((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported)
in Utf16Utility.GetPointerToFirstInvalidChar
but keep the code under else if (Vector.IsHardwareAccelerated)
and replace it with if (Vector.IsHardwareAccelerated)
which I presume would be true
on Arm64 I will get the following results
(AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian)
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-IXQCIC : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.7 μs | 0.65 μs | 0.60 μs | 109.5 μs | 109.1 μs | 111.0 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 263.1 μs | 4.69 μs | 4.39 μs | 260.7 μs | 259.0 μs | 273.0 μs | 48.9583 | 48.9583 | 48.9583 | 163854 B |
GetString | EnglishAllAscii | 206.7 μs | 0.43 μs | 0.36 μs | 206.6 μs | 206.2 μs | 207.5 μs | 99.5066 | 99.5066 | 99.5066 | 327677 B |
GetByteCount | EnglishMostlyAscii | 566.1 μs | 2.25 μs | 2.11 μs | 564.5 μs | 563.8 μs | 568.9 μs | - | - | - | 1 B |
GetBytes | EnglishMostlyAscii | 909.5 μs | 2.34 μs | 1.96 μs | 909.3 μs | 907.1 μs | 914.0 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 573.2 μs | 1.73 μs | 1.35 μs | 572.8 μs | 572.1 μs | 577.0 μs | 99.5370 | 99.5370 | 99.5370 | 327685 B |
GetByteCount | Chinese | 257.7 μs | 0.37 μs | 0.33 μs | 257.7 μs | 257.0 μs | 258.1 μs | - | - | - | - |
GetBytes | Chinese | 749.4 μs | 3.56 μs | 2.97 μs | 748.3 μs | 746.2 μs | 756.4 μs | 53.5714 | 53.5714 | 53.5714 | 177768 B |
GetString | Chinese | 899.9 μs | 7.84 μs | 7.33 μs | 895.3 μs | 893.2 μs | 913.5 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 223.9 μs | 0.20 μs | 0.18 μs | 223.9 μs | 223.5 μs | 224.0 μs | - | - | - | - |
GetBytes | Cyrillic | 593.4 μs | 3.35 μs | 2.97 μs | 592.4 μs | 590.1 μs | 598.7 μs | 30.0926 | 30.0926 | 30.0926 | 100889 B |
GetString | Cyrillic | 630.6 μs | 1.14 μs | 0.95 μs | 630.8 μs | 628.6 μs | 631.7 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 280.7 μs | 0.95 μs | 0.84 μs | 280.3 μs | 279.6 μs | 282.1 μs | - | - | - | - |
GetBytes | Greek | 844.0 μs | 2.82 μs | 2.35 μs | 843.8 μs | 840.8 μs | 849.3 μs | 39.4737 | 39.4737 | 39.4737 | 129260 B |
GetString | Greek | 963.8 μs | 1.75 μs | 1.46 μs | 964.2 μs | 960.2 μs | 965.7 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
(Vector.IsHardwareAccelerated)
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-ECUDLG : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.2 μs | 0.19 μs | 0.16 μs | 109.2 μs | 109.0 μs | 109.6 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 263.9 μs | 0.79 μs | 0.66 μs | 263.6 μs | 263.3 μs | 265.2 μs | 49.5690 | 49.5690 | 49.5690 | 163855 B |
GetString | EnglishAllAscii | 206.2 μs | 1.16 μs | 0.96 μs | 205.9 μs | 205.0 μs | 208.5 μs | 99.3151 | 99.3151 | 99.3151 | 327677 B |
GetByteCount | EnglishMostlyAscii | 274.0 μs | 0.23 μs | 0.18 μs | 274.0 μs | 273.5 μs | 274.2 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 612.1 μs | 2.85 μs | 2.38 μs | 611.3 μs | 610.3 μs | 618.0 μs | 50.4808 | 50.4808 | 50.4808 | 169895 B |
GetString | EnglishMostlyAscii | 573.9 μs | 4.85 μs | 4.53 μs | 572.0 μs | 569.4 μs | 583.5 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 125.3 μs | 0.26 μs | 0.23 μs | 125.3 μs | 124.9 μs | 125.7 μs | - | - | - | - |
GetBytes | Chinese | 615.4 μs | 2.08 μs | 1.84 μs | 615.8 μs | 610.7 μs | 617.8 μs | 55.2885 | 55.2885 | 55.2885 | 177769 B |
GetString | Chinese | 900.0 μs | 6.04 μs | 5.65 μs | 896.8 μs | 894.4 μs | 912.4 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 108.8 μs | 0.07 μs | 0.07 μs | 108.7 μs | 108.7 μs | 108.9 μs | - | - | - | - |
GetBytes | Cyrillic | 466.2 μs | 0.97 μs | 0.76 μs | 466.3 μs | 464.7 μs | 467.1 μs | 29.4118 | 29.4118 | 29.4118 | 100889 B |
GetString | Cyrillic | 623.3 μs | 1.09 μs | 0.91 μs | 623.3 μs | 621.5 μs | 624.8 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 136.8 μs | 0.15 μs | 0.14 μs | 136.7 μs | 136.6 μs | 137.1 μs | - | - | - | - |
GetBytes | Greek | 679.6 μs | 2.22 μs | 1.86 μs | 680.1 μs | 675.1 μs | 682.0 μs | 38.0435 | 38.0435 | 38.0435 | 129260 B |
GetString | Greek | 971.7 μs | 6.79 μs | 6.35 μs | 968.5 μs | 965.5 μs | 984.9 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
One more observation - if I remove the code under if ((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported) in Utf16Utility.GetPointerToFirstInvalidChar but keep the code under else if (Vector.IsHardwareAccelerated) and replace it with if (Vector.IsHardwareAccelerated) which I presume would be true on Arm64 I will get the following results
If I understand it correctly, you are saying that you reverted the changes in GetPointerToFirstInvalidChar ()
so the code fall backs to Vector.IsHardwareAccelerated
(which it was happening before we optimized the method with ARM64 intrinsics), we see the improvements?
From offline conversation with @pgovind and @carlossanlop , I recall that the benchmarks touched the methods improved in https://github.com/dotnet/runtime/pull/39506, https://github.com/dotnet/runtime/pull/39508, https://github.com/dotnet/runtime/pull/39050, https://github.com/dotnet/runtime/pull/39041 and https://github.com/dotnet/runtime/pull/38653 (correct me if I am wrong). While you are there, can you do similar change at other places these PRs touched (specially those methods that mimic SSE logic and can be improved by different algorithm for ARM64) to see if the vectorized implementation was fast enough?
If I understand it correctly, you are saying that you reverted the changes in
GetPointerToFirstInvalidChar ()
so the code fall backs toVector.IsHardwareAccelerated
(which it was happening before we optimized the method with ARM64 intrinsics), we see the improvements?
@kunalspathak That's right
From offline conversation with @pgovind and @carlossanlop , I recall that the benchmarks touched the methods improved in #39506, #39508, #39050, #39041 and #38653 (correct me if I am wrong). While you are there, can you do similar change at other places these PRs touched (specially those methods that mimic SSE logic and can be improved by different algorithm for ARM64) to see if the vectorized implementation was fast enough?
@kunalspathak I didn't do exactly what you suggested - I don't see a clear way to undo all the work and switch back to the vectorized implementations. Instead I altered JIT the following way
diff --git a/src/coreclr/src/jit/hwintrinsic.cpp b/src/coreclr/src/jit/hwintrinsic.cpp
index 5723ac8f322..95f8988babc 100644
--- a/src/coreclr/src/jit/hwintrinsic.cpp
+++ b/src/coreclr/src/jit/hwintrinsic.cpp
@@ -277,7 +277,7 @@ NamedIntrinsic HWIntrinsicInfo::lookupId(Compiler* comp,
if (strcmp(methodName, "get_IsSupported") == 0)
{
- return isIsaSupported ? NI_IsSupported_True : NI_IsSupported_False;
+ return NI_IsSupported_False;
}
else if (!isIsaSupported)
{
and measured before and after the change.
Before- get_IsSupported
returns "real" value
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-ISREIK : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 108.9 μs | 0.14 μs | 0.12 μs | 108.8 μs | 108.8 μs | 109.1 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 263.9 μs | 2.42 μs | 2.15 μs | 262.9 μs | 261.9 μs | 268.4 μs | 49.7881 | 49.7881 | 49.7881 | 163855 B |
GetString | EnglishAllAscii | 205.8 μs | 0.36 μs | 0.30 μs | 205.8 μs | 205.3 μs | 206.4 μs | 99.5066 | 99.5066 | 99.5066 | 327677 B |
GetByteCount | EnglishMostlyAscii | 564.6 μs | 0.92 μs | 0.77 μs | 564.3 μs | 563.8 μs | 566.6 μs | - | - | - | 1 B |
GetBytes | EnglishMostlyAscii | 917.1 μs | 2.59 μs | 2.17 μs | 916.2 μs | 915.4 μs | 922.5 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 569.7 μs | 3.87 μs | 3.43 μs | 568.0 μs | 567.1 μs | 577.1 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 257.4 μs | 0.37 μs | 0.31 μs | 257.3 μs | 257.1 μs | 258.2 μs | - | - | - | - |
GetBytes | Chinese | 743.4 μs | 5.77 μs | 5.40 μs | 740.3 μs | 738.1 μs | 751.1 μs | 53.5714 | 53.5714 | 53.5714 | 177768 B |
GetString | Chinese | 895.0 μs | 1.42 μs | 1.11 μs | 895.0 μs | 893.5 μs | 897.9 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 223.9 μs | 0.24 μs | 0.20 μs | 223.9 μs | 223.5 μs | 224.3 μs | - | - | - | - |
GetBytes | Cyrillic | 591.2 μs | 2.98 μs | 2.64 μs | 590.0 μs | 588.3 μs | 597.4 μs | 30.0926 | 30.0926 | 30.0926 | 100889 B |
GetString | Cyrillic | 631.7 μs | 4.12 μs | 3.86 μs | 629.9 μs | 627.6 μs | 638.2 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 281.5 μs | 0.23 μs | 0.19 μs | 281.5 μs | 281.3 μs | 282.0 μs | - | - | - | - |
GetBytes | Greek | 839.5 μs | 5.01 μs | 3.91 μs | 840.5 μs | 827.3 μs | 841.6 μs | 39.4737 | 39.4737 | 39.4737 | 129260 B |
GetString | Greek | 964.1 μs | 3.75 μs | 3.13 μs | 962.9 μs | 960.6 μs | 971.3 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
After - get_IsSupported
returns false
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
Job-FSEUVY : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.3 μs | 0.13 μs | 0.11 μs | 109.2 μs | 109.2 μs | 109.5 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 260.4 μs | 0.70 μs | 0.54 μs | 260.2 μs | 259.8 μs | 261.4 μs | 48.9583 | 48.9583 | 48.9583 | 163854 B |
GetString | EnglishAllAscii | 293.1 μs | 0.64 μs | 0.57 μs | 293.1 μs | 292.1 μs | 294.4 μs | 99.5370 | 99.5370 | 99.5370 | 327677 B |
GetByteCount | EnglishMostlyAscii | 274.2 μs | 0.49 μs | 0.43 μs | 274.2 μs | 273.6 μs | 275.2 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 632.0 μs | 1.85 μs | 1.55 μs | 631.8 μs | 629.7 μs | 635.4 μs | 52.5000 | 52.5000 | 52.5000 | 169896 B |
GetString | EnglishMostlyAscii | 599.5 μs | 2.67 μs | 2.23 μs | 599.6 μs | 597.1 μs | 604.7 μs | 99.5370 | 99.5370 | 99.5370 | 327685 B |
GetByteCount | Chinese | 125.3 μs | 0.16 μs | 0.15 μs | 125.3 μs | 125.1 μs | 125.6 μs | - | - | - | - |
GetBytes | Chinese | 610.7 μs | 3.15 μs | 2.63 μs | 609.5 μs | 608.0 μs | 617.2 μs | 55.2885 | 55.2885 | 55.2885 | 177769 B |
GetString | Chinese | 907.8 μs | 4.83 μs | 4.28 μs | 906.2 μs | 904.2 μs | 917.9 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 108.9 μs | 0.13 μs | 0.11 μs | 108.9 μs | 108.7 μs | 109.0 μs | - | - | - | - |
GetBytes | Cyrillic | 478.7 μs | 1.72 μs | 1.43 μs | 478.4 μs | 476.2 μs | 481.5 μs | 30.3030 | 30.3030 | 30.3030 | 100889 B |
GetString | Cyrillic | 659.2 μs | 4.27 μs | 4.00 μs | 657.3 μs | 655.1 μs | 666.1 μs | 39.0625 | 39.0625 | 39.0625 | 130868 B |
GetByteCount | Greek | 137.0 μs | 0.20 μs | 0.19 μs | 137.0 μs | 136.6 μs | 137.3 μs | - | - | - | - |
GetBytes | Greek | 692.5 μs | 4.49 μs | 4.20 μs | 690.8 μs | 686.4 μs | 699.9 μs | 38.0435 | 38.0435 | 38.0435 | 129260 B |
GetString | Greek | 971.4 μs | 0.89 μs | 0.69 μs | 971.2 μs | 970.7 μs | 972.5 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
As @GrabYourPitchforks pointed out in Teams chat - we don't need PopCount
when computing number of non-ascii characters
@carlossanlop When working on this issue you can consider one of the changes below to eliminate PopCount
. Both have almost the same performance characteristics based on this benchmark.
Baseline
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), Arm64 RyuJIT
Job-KKGTQA : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.4 μs | 0.17 μs | 0.15 μs | 109.4 μs | 109.3 μs | 109.7 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 261.9 μs | 3.65 μs | 3.42 μs | 259.7 μs | 259.1 μs | 268.4 μs | 49.1803 | 49.1803 | 49.1803 | 163854 B |
GetString | EnglishAllAscii | 207.3 μs | 2.09 μs | 1.85 μs | 206.5 μs | 205.5 μs | 211.2 μs | 99.6622 | 99.6622 | 99.6622 | 327677 B |
GetByteCount | EnglishMostlyAscii | 565.0 μs | 0.63 μs | 0.59 μs | 565.2 μs | 563.6 μs | 565.7 μs | - | - | - | 1 B |
GetBytes | EnglishMostlyAscii | 916.0 μs | 6.50 μs | 6.08 μs | 913.0 μs | 910.4 μs | 927.8 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 567.2 μs | 1.91 μs | 1.59 μs | 567.0 μs | 565.6 μs | 571.4 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 257.5 μs | 0.42 μs | 0.37 μs | 257.4 μs | 256.9 μs | 258.2 μs | - | - | - | - |
GetBytes | Chinese | 750.7 μs | 5.85 μs | 5.48 μs | 747.1 μs | 745.4 μs | 762.8 μs | 53.5714 | 53.5714 | 53.5714 | 177768 B |
GetString | Chinese | 901.0 μs | 11.19 μs | 10.47 μs | 894.8 μs | 891.6 μs | 920.5 μs | 45.1389 | 45.1389 | 45.1389 | 150127 B |
GetByteCount | Cyrillic | 224.1 μs | 0.16 μs | 0.14 μs | 224.2 μs | 223.8 μs | 224.3 μs | - | - | - | - |
GetBytes | Cyrillic | 583.2 μs | 1.17 μs | 0.91 μs | 583.1 μs | 581.2 μs | 584.6 μs | 30.0926 | 30.0926 | 30.0926 | 100889 B |
GetString | Cyrillic | 634.6 μs | 1.79 μs | 1.50 μs | 634.0 μs | 633.0 μs | 638.2 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 281.6 μs | 0.40 μs | 0.35 μs | 281.6 μs | 281.2 μs | 282.5 μs | - | - | - | - |
GetBytes | Greek | 836.5 μs | 2.49 μs | 1.94 μs | 836.3 μs | 834.1 μs | 841.1 μs | 39.4737 | 39.4737 | 39.4737 | 129260 B |
GetString | Greek | 951.8 μs | 6.22 μs | 5.51 μs | 951.1 μs | 944.0 μs | 965.1 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
Vector ops
diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..d395252384a 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()
Vector128<ushort> charIsThreeByteUtf8Encoded;
uint mask;
+ uint popcnt;
if (AdvSimd.IsSupported)
{
charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
- mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+ popcnt = CountNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
}
else
{
charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+ popcnt = (uint)BitOperations.PopCount(mask);
}
// Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
// unpaired surrogates in our data. (Unpaired surrogates would invalidate
// our computed result and we'd have to throw it away.)
- uint popcnt = (uint)BitOperations.PopCount(mask);
-
// Surrogates need to be special-cased for two reasons: (a) we need
// to account for the fact that we over-counted in the addition above;
// and (b) they require separate validation.
@@ -485,6 +485,17 @@ static Utf16Utility()
return pInputBuffer;
}
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ private static uint CountNonAsciiBytes(Vector128<byte> vec)
+ {
+ vec = AdvSimd.ShiftRightLogical(vec, 7);
+ vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+ vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+ vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+ vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+ return vec.ToScalar();
+ }
+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
{
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
Job-HJYIJI : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.4 μs | 0.55 μs | 0.48 μs | 109.1 μs | 108.9 μs | 110.3 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 262.6 μs | 3.82 μs | 3.57 μs | 260.4 μs | 259.5 μs | 269.4 μs | 49.1803 | 49.1803 | 49.1803 | 163854 B |
GetString | EnglishAllAscii | 205.8 μs | 0.44 μs | 0.37 μs | 205.6 μs | 205.4 μs | 206.5 μs | 99.5066 | 99.5066 | 99.5066 | 327677 B |
GetByteCount | EnglishMostlyAscii | 332.3 μs | 0.26 μs | 0.23 μs | 332.3 μs | 332.0 μs | 332.8 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 674.1 μs | 1.54 μs | 1.37 μs | 673.9 μs | 672.5 μs | 676.7 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 573.1 μs | 7.73 μs | 6.85 μs | 569.8 μs | 567.9 μs | 590.3 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 151.6 μs | 0.11 μs | 0.09 μs | 151.6 μs | 151.5 μs | 151.8 μs | - | - | - | - |
GetBytes | Chinese | 642.5 μs | 2.60 μs | 2.17 μs | 641.8 μs | 640.0 μs | 646.9 μs | 55.0000 | 55.0000 | 55.0000 | 177769 B |
GetString | Chinese | 897.8 μs | 9.84 μs | 8.72 μs | 892.7 μs | 890.3 μs | 916.3 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 131.9 μs | 0.24 μs | 0.20 μs | 132.0 μs | 131.6 μs | 132.2 μs | - | - | - | - |
GetBytes | Cyrillic | 485.1 μs | 5.13 μs | 4.55 μs | 482.1 μs | 481.5 μs | 494.4 μs | 30.3030 | 30.3030 | 30.3030 | 100890 B |
GetString | Cyrillic | 626.3 μs | 4.51 μs | 4.00 μs | 624.5 μs | 622.5 μs | 635.3 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 165.7 μs | 0.15 μs | 0.12 μs | 165.7 μs | 165.6 μs | 166.0 μs | - | - | - | - |
GetBytes | Greek | 707.8 μs | 6.85 μs | 6.08 μs | 707.8 μs | 699.7 μs | 721.7 μs | 38.0435 | 38.0435 | 38.0435 | 129260 B |
GetString | Greek | 967.0 μs | 3.19 μs | 2.49 μs | 966.4 μs | 965.1 μs | 974.3 μs | 47.7941 | 47.7941 | 47.7941 | 164279 B |
Vector ops +Scalar ops
diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..c958d7ecee4 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()
Vector128<ushort> charIsThreeByteUtf8Encoded;
uint mask;
+ uint popcnt;
if (AdvSimd.IsSupported)
{
charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
- mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+ popcnt = CountNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
}
else
{
charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+ popcnt = (uint)BitOperations.PopCount(mask);
}
// Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
// unpaired surrogates in our data. (Unpaired surrogates would invalidate
// our computed result and we'd have to throw it away.)
- uint popcnt = (uint)BitOperations.PopCount(mask);
-
// Surrogates need to be special-cased for two reasons: (a) we need
// to account for the fact that we over-counted in the addition above;
// and (b) they require separate validation.
@@ -485,6 +485,17 @@ static Utf16Utility()
return pInputBuffer;
}
+ [MethodImpl(MethodImplOptions.AggressiveInlining)]
+ private static uint CountNonAsciiBytes(Vector128<byte> vec)
+ {
+ vec = AdvSimd.ShiftRightLogical(vec, 7);
+ ulong temp = AdvSimd.Arm64.AddPairwiseScalar(vec.AsUInt64()).ToScalar();
+ temp += (temp >> 32);
+ temp += (temp >> 16);
+ temp += (temp >> 8);
+ return (byte)temp;
+ }
+
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
{
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
[Host] : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
Job-WVTENM : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
PowerPlanMode=00000000-0000-0000-0000-000000000000 Runtime=.NET Core 5.0 Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0 IterationTime=250.0000 ms MaxIterationCount=20
MinIterationCount=15 WarmupCount=1
Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 109.3 μs | 0.69 μs | 0.61 μs | 109.1 μs | 108.7 μs | 110.4 μs | - | - | - | - |
GetBytes | EnglishAllAscii | 260.8 μs | 1.28 μs | 1.07 μs | 260.6 μs | 259.7 μs | 263.5 μs | 48.9583 | 48.9583 | 48.9583 | 163854 B |
GetString | EnglishAllAscii | 207.2 μs | 2.16 μs | 1.80 μs | 206.4 μs | 205.6 μs | 212.1 μs | 99.6622 | 99.6622 | 99.6622 | 327677 B |
GetByteCount | EnglishMostlyAscii | 329.4 μs | 0.15 μs | 0.12 μs | 329.4 μs | 329.2 μs | 329.6 μs | - | - | - | - |
GetBytes | EnglishMostlyAscii | 673.6 μs | 5.31 μs | 4.96 μs | 672.2 μs | 667.7 μs | 681.8 μs | 52.0833 | 52.0833 | 52.0833 | 169896 B |
GetString | EnglishMostlyAscii | 566.7 μs | 1.20 μs | 1.00 μs | 566.2 μs | 565.7 μs | 569.0 μs | 98.2143 | 98.2143 | 98.2143 | 327685 B |
GetByteCount | Chinese | 150.4 μs | 0.11 μs | 0.10 μs | 150.4 μs | 150.3 μs | 150.7 μs | - | - | - | - |
GetBytes | Chinese | 636.4 μs | 2.00 μs | 1.67 μs | 635.9 μs | 634.3 μs | 640.7 μs | 55.0000 | 55.0000 | 55.0000 | 177769 B |
GetString | Chinese | 899.4 μs | 4.00 μs | 3.75 μs | 898.8 μs | 895.3 μs | 908.0 μs | 45.1389 | 45.1389 | 45.1389 | 150126 B |
GetByteCount | Cyrillic | 130.6 μs | 0.13 μs | 0.11 μs | 130.6 μs | 130.4 μs | 130.8 μs | - | - | - | - |
GetBytes | Cyrillic | 483.9 μs | 2.54 μs | 2.25 μs | 483.9 μs | 480.8 μs | 488.9 μs | 30.3030 | 30.3030 | 30.3030 | 100889 B |
GetString | Cyrillic | 637.4 μs | 4.27 μs | 3.99 μs | 635.4 μs | 633.5 μs | 646.2 μs | 37.5000 | 37.5000 | 37.5000 | 130868 B |
GetByteCount | Greek | 164.6 μs | 0.23 μs | 0.21 μs | 164.6 μs | 164.2 μs | 164.9 μs | - | - | - | - |
GetBytes | Greek | 704.7 μs | 6.24 μs | 5.83 μs | 702.4 μs | 698.1 μs | 715.8 μs | 38.0435 | 38.0435 | 38.0435 | 129260 B |
GetString | Greek | 948.6 μs | 7.42 μs | 6.94 μs | 947.5 μs | 940.8 μs | 964.7 μs | 46.8750 | 46.8750 | 46.8750 | 164279 B |
cc @TamarChristinaArm who can suggest what would be the best choice here.
@echesakovMSFT I'll need to take a closer look, but as an initial assessment the temp += (temp >> 32);
should be slightly better if you are generating an ADD
with a shifted register. (as in, a single instruction rather than a separate add
+ shift
).
That said looking at the algorithm, do you really need the need the reduction inside the loop? The value seems to really only be a counter. So instead can't you keep the value as a vector128<uint>
during the loop and perform the final addp
and move to genreg side after the loop and add it to tempUtf8CodeUnitCountAdjustment
?
I think we should look at the function as a whole instead of piece wise.
For instance since the only things done on popcnt
are add
and sub
there's no need to transfer between register files in the loop.
+ private static Vector64<uint> CountNonAsciiBytes(Vector128<byte> vec)
and using AddScalar
instead during the loop avoids the transfer as we can do scalar arithmetic on the SIMD side.
You're right. You could avoid doing it piecemeal inside the loop, but you'd need to use caution to avoid integer overflow. If you assume that the accumulator vector is a Vector128<byte>
, then you could run at most 255 loop iterations without risking overflow. Then, before every 256th iteration, you'd horizontal add the vector accumulator elements together and add the result to a running scalar accumulator.
Then, before every 256th iteration, you'd horizontal add the vector accumulator elements together and add the result to a running scalar accumulator.
You can avoid that by using a widening pairwise addition UADDLP
(instead of normal pairwise add) till you get a vector128<uint>
and then use a widening addition when accumulating into your counter which can be a vector128<ulong>
. You'd need to accumulate into two vector128<ulong>
. using UADDW{2}
and outside the loop add those two up, but that's just a cheap loop epilogue.
I think you can also do it and avoid the extra register pressure by using widening pairwise additionsUADDLP
instead of the normal ADDPP
to get a single vector128<ulong>
. It requires one less register but requires an additional VADD
into the counter.
Also usually on architectures that support it you could use a UDOT
to get a fast widening accumulation from 16b
to 4s
by using as the multiplicant a vector of ones
. However this would only be beneficial if you needed the accumulate the results as Int. In this case since you'd want a long you can't do the accumulation itself in the dotprod so you'd have to use as the initial value a vector of zeros
so have a movi
before each call which makes it not really a faster sequence.
I also think it's better to use an AND
or BIC
here
vec = AdvSimd.ShiftRightLogical(vec, 7);
as USHR
is restricted to one NEON pipe where an AND
can go in any. You just have to hoist the constant out of the loop.
Based on the recent data, we want to try to at least work around this regression in 5.0.0. I'm not ready to consider it release-blocking
, yet, but let's see what the workaround/fix would be.
@jeffhandley Sorry I'm slightly confused, this ticket is about Utf8Encoding
but so far me and @echesakovMSFT and @GrabYourPitchforks have been discussing I believe Utf16Enconding
. On which one is the big regression?
Both of them have some reasonably simple things you can do without changing the entire algorithm to speed up the common cases the AllEnglishAscii
and the AlmostEnglishAscii
. Utf8 should be the simplest as that just finds the first non-ascii character.
In the initial test you use to check if you have any non-ascii characters you can optimize the
ulong mask = GetNonAsciiBytes(AdvSimd.LoadVector128(pInputBuffer), bitMask128);
if (mask != 0)
to
smaxp pInputBuffer.16b, pInputBuffer.16b, pInputBuffer.16b
fmov synd, pInputBuffer
tst synd, 0x8080808080808080
to find the index you use https://github.com/dotnet/runtime/pull/39507#discussion_r468097953 which has you restart the calculation but it's fine since you are exiting the loop anyway.
to Utf8 that's a simple modification within the current algorithm that should allow you to more than recover the performance. Utf16 is a bit more complicated but can use the same trick to avoid doing the more expensive operation until needed.
If we really need to do something in 5.0 and we're running out of runway then the absolute safest thing to do would be to change the one line:
To:
if (Sse2.IsSupported)
This will cause the UTF8Encoding.GetByteCount(string)
method to fall back to the existing Vector<T>
-based code paths as they existed in 3.1 instead of using the new intrinsics that were introduced in 5.0.
I assume that if we want to do this then we'd schedule a "proper" fix to come in during 5.0.1.
Also, here's a magic decoder ring for the perf tests.
The test method GetByteCount calls Utf16Utility.GetPointerToFirstInvalidChar
.
The test method GetBytes calls Utf16Utility.GetPointerToFirstInvalidChar
, then calls Utf8Utility.TranscodeToUtf8
.
The test method GetString calls Utf8Utility.GetPointerToFirstInvalidByte
, then calls Utf8Utility.TranscodeToUtf16
.
On which one is the big regression?
Yeah, the GetByteCount
regression is the one that was noteworthy to me, @adamsitnik, and @tannergooding--that one regressed across the board and we should consider addressing that in RC2.
Falling back to the non-vectorized path, as @GrabYourPitchforks suggested, has a low enough risk that it could be considered for Ask Mode (given the existing test coverage of that). And then as @GrabYourPitchforks stated, we could pursue a more complete fix after 5.0.0.
There are regressions in GetBytes()
as well, and it calls Utf8Utility.TranscodeToUtf8
as well, but I assume we don't want to revert it because that benchmark also calls Utf16Utility.GetPointerToFirstInvalidChar
which we are reverting anyway?
I opened https://github.com/dotnet/performance/issues/1512 to track changing the benchmarks so that each benchmark is testing exactly one worker function. But the best evidence we have right now suggests that GetPointerToFirstInvalidChar
is the bulk of the regression, so that's where the efforts / reversions are currently being focused.
Re-opening this issue as #42052 was meant to be a temporary workaround and the underlying issue is still open
@jeffhandley should this still be in the 5.0.0 milestone or be moved to 6.0/7.0?
Assigning to @TIHan.
I have some data comparing .NET 5, 7, 8:
BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1574/21H2) Snapdragon 7c 2.40 GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK=8.0.100-preview.2.23157.25 [Host] : .NET 7.0.4 (7.0.423.11508), Arm64 RyuJIT AdvSIMD Job-YXVEBW : .NET 5.0.17 (5.0.1722.21314), Arm64 RyuJIT AdvSIMD Job-HYZVQW : .NET 7.0.4 (7.0.423.11508), Arm64 RyuJIT AdvSIMD Job-MMQVIX : .NET 8.0.0 (8.0.23.12803), Arm64 RyuJIT AdvSIMD
PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Runtime | Input | Mean | Error | StdDev | Median | Min | Max | Ratio | RatioSD | Gen0 | Gen1 | Gen2 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | .NET 5.0 | EnglishAllAscii | 36.34 us | 0.592 us | 0.554 us | 36.09 us | 35.78 us | 37.33 us | 1.00 | 0.00 | - | - | - | - | NA |
GetByteCount | .NET 7.0 | EnglishAllAscii | 12.90 us | 0.248 us | 0.243 us | 12.84 us | 12.66 us | 13.53 us | 0.36 | 0.01 | - | - | - | - | NA |
GetByteCount | .NET 8.0 | EnglishAllAscii | 12.13 us | 0.380 us | 0.373 us | 12.02 us | 11.73 us | 13.21 us | 0.33 | 0.01 | - | - | - | - | NA |
GetBytes | .NET 5.0 | EnglishAllAscii | 98.71 us | 1.625 us | 1.440 us | 98.52 us | 97.00 us | 101.97 us | 1.00 | 0.00 | 52.4691 | 52.4691 | 52.4691 | 167576 B | 1.00 |
GetBytes | .NET 7.0 | EnglishAllAscii | 82.24 us | 1.497 us | 1.401 us | 82.08 us | 79.84 us | 84.30 us | 0.84 | 0.02 | 52.4611 | 52.4611 | 52.4611 | 167594 B | 1.00 |
GetBytes | .NET 8.0 | EnglishAllAscii | 78.99 us | 1.570 us | 1.468 us | 78.70 us | 76.72 us | 81.75 us | 0.80 | 0.02 | 52.3897 | 52.3897 | 52.3897 | 167594 B | 1.00 |
GetString | .NET 5.0 | EnglishAllAscii | 89.11 us | 1.685 us | 1.803 us | 89.50 us | 86.48 us | 91.65 us | 1.00 | 0.00 | 99.9313 | 99.9313 | 99.9313 | 335120 B | 1.00 |
GetString | .NET 7.0 | EnglishAllAscii | 91.48 us | 1.823 us | 1.951 us | 91.26 us | 88.83 us | 96.25 us | 1.03 | 0.03 | 99.6429 | 99.6429 | 99.6429 | 335154 B | 1.00 |
GetString | .NET 8.0 | EnglishAllAscii | 88.26 us | 1.677 us | 1.722 us | 88.63 us | 85.71 us | 90.54 us | 0.99 | 0.02 | 99.7268 | 99.7268 | 99.7268 | 335154 B | 1.00 |
GetByteCount | .NET 5.0 | EnglishMostlyAscii | 89.67 us | 1.470 us | 1.375 us | 89.46 us | 87.98 us | 92.34 us | 1.00 | 0.00 | - | - | - | - | NA |
GetByteCount | .NET 7.0 | EnglishMostlyAscii | 71.91 us | 0.320 us | 0.299 us | 71.75 us | 71.60 us | 72.40 us | 0.80 | 0.01 | - | - | - | - | NA |
GetByteCount | .NET 8.0 | EnglishMostlyAscii | 68.23 us | 0.530 us | 0.469 us | 68.00 us | 67.77 us | 69.29 us | 0.76 | 0.01 | - | - | - | - | NA |
GetBytes | .NET 5.0 | EnglishMostlyAscii | 221.88 us | 1.747 us | 1.634 us | 221.87 us | 219.81 us | 225.12 us | 1.00 | 0.00 | 52.0833 | 52.0833 | 52.0833 | 173616 B | 1.00 |
GetBytes | .NET 7.0 | EnglishMostlyAscii | 207.47 us | 1.624 us | 1.519 us | 207.27 us | 205.11 us | 210.26 us | 0.94 | 0.01 | 52.5000 | 52.5000 | 52.5000 | 173634 B | 1.00 |
GetBytes | .NET 8.0 | EnglishMostlyAscii | 206.32 us | 4.211 us | 4.850 us | 204.08 us | 201.10 us | 216.76 us | 0.94 | 0.02 | 51.9481 | 51.9481 | 51.9481 | 173634 B | 1.00 |
GetString | .NET 5.0 | EnglishMostlyAscii | 265.13 us | 18.047 us | 20.783 us | 255.43 us | 242.75 us | 314.89 us | 1.00 | 0.00 | 87.8906 | 87.8906 | 87.8906 | 335126 B | 1.00 |
GetString | .NET 7.0 | EnglishMostlyAscii | 289.28 us | 16.755 us | 19.295 us | 280.65 us | 267.11 us | 333.52 us | 1.10 | 0.09 | 80.0000 | 80.0000 | 80.0000 | 335152 B | 1.00 |
GetString | .NET 8.0 | EnglishMostlyAscii | 408.62 us | 117.412 us | 135.211 us | 405.00 us | 264.96 us | 668.61 us | 1.55 | 0.52 | 80.5288 | 80.5288 | 80.5288 | 335152 B | 1.00 |
GetByteCount | .NET 5.0 | Chinese | 40.90 us | 0.258 us | 0.215 us | 40.83 us | 40.60 us | 41.33 us | 1.00 | 0.00 | - | - | - | - | NA |
GetByteCount | .NET 7.0 | Chinese | 33.29 us | 0.187 us | 0.166 us | 33.29 us | 33.09 us | 33.64 us | 0.81 | 0.01 | - | - | - | - | NA |
GetByteCount | .NET 8.0 | Chinese | 31.50 us | 0.222 us | 0.197 us | 31.40 us | 31.33 us | 31.91 us | 0.77 | 0.01 | - | - | - | - | NA |
GetBytes | .NET 5.0 | Chinese | 228.18 us | 1.467 us | 1.225 us | 228.20 us | 226.42 us | 229.93 us | 1.00 | 0.00 | 55.2536 | 55.2536 | 55.2536 | 180680 B | 1.00 |
GetBytes | .NET 7.0 | Chinese | 242.76 us | 1.883 us | 1.761 us | 242.37 us | 240.12 us | 246.59 us | 1.06 | 0.01 | 54.8077 | 54.8077 | 54.8077 | 180699 B | 1.00 |
GetBytes | .NET 8.0 | Chinese | 235.66 us | 2.260 us | 1.887 us | 235.10 us | 234.21 us | 240.67 us | 1.03 | 0.01 | 55.0373 | 55.0373 | 55.0373 | 180699 B | 1.00 |
GetString | .NET 5.0 | Chinese | 366.05 us | 2.145 us | 1.901 us | 365.51 us | 363.91 us | 370.45 us | 1.00 | 0.00 | 46.5116 | 46.5116 | 46.5116 | 155960 B | 1.00 |
GetString | .NET 7.0 | Chinese | 384.52 us | 1.771 us | 1.570 us | 383.93 us | 382.36 us | 387.29 us | 1.05 | 0.01 | 47.2561 | 47.2561 | 47.2561 | 155977 B | 1.00 |
GetString | .NET 8.0 | Chinese | 394.77 us | 3.329 us | 3.114 us | 394.23 us | 391.20 us | 401.45 us | 1.08 | 0.01 | 46.8750 | 46.8750 | 46.8750 | 155976 B | 1.00 |
GetByteCount | .NET 5.0 | Cyrillic | 29.29 us | 0.322 us | 0.285 us | 29.15 us | 29.08 us | 29.97 us | 1.00 | 0.00 | - | - | - | - | NA |
GetByteCount | .NET 7.0 | Cyrillic | 27.10 us | 0.191 us | 0.169 us | 27.06 us | 26.88 us | 27.44 us | 0.93 | 0.01 | - | - | - | - | NA |
GetByteCount | .NET 8.0 | Cyrillic | 23.12 us | 0.127 us | 0.113 us | 23.09 us | 22.98 us | 23.30 us | 0.79 | 0.01 | - | - | - | - | NA |
GetBytes | .NET 5.0 | Cyrillic | 176.68 us | 1.383 us | 1.226 us | 176.72 us | 174.90 us | 178.73 us | 1.00 | 0.00 | 31.9767 | 31.9767 | 31.9767 | 102272 B | 1.00 |
GetBytes | .NET 7.0 | Cyrillic | 173.70 us | 1.335 us | 1.115 us | 173.50 us | 172.31 us | 176.47 us | 0.98 | 0.01 | 31.9444 | 31.9444 | 31.9444 | 102283 B | 1.00 |
GetBytes | .NET 8.0 | Cyrillic | 169.95 us | 1.239 us | 1.098 us | 169.83 us | 167.53 us | 172.34 us | 0.96 | 0.01 | 31.9293 | 31.9293 | 31.9293 | 102283 B | 1.00 |
GetString | .NET 5.0 | Cyrillic | 264.45 us | 1.351 us | 1.128 us | 264.30 us | 262.80 us | 266.54 us | 1.00 | 0.00 | 41.3136 | 41.3136 | 41.3136 | 133640 B | 1.00 |
GetString | .NET 7.0 | Cyrillic | 271.45 us | 1.796 us | 1.680 us | 271.21 us | 268.76 us | 274.51 us | 1.03 | 0.01 | 40.9483 | 40.9483 | 40.9483 | 133654 B | 1.00 |
GetString | .NET 8.0 | Cyrillic | 268.52 us | 1.359 us | 1.135 us | 268.45 us | 266.14 us | 270.14 us | 1.02 | 0.01 | 40.9483 | 40.9483 | 40.9483 | 133654 B | 1.00 |
GetByteCount | .NET 5.0 | Greek | 44.41 us | 0.245 us | 0.204 us | 44.34 us | 44.22 us | 44.97 us | 1.00 | 0.00 | - | - | - | - | NA |
GetByteCount | .NET 7.0 | Greek | 36.18 us | 0.175 us | 0.137 us | 36.14 us | 36.01 us | 36.49 us | 0.81 | 0.01 | - | - | - | - | NA |
GetByteCount | .NET 8.0 | Greek | 34.11 us | 0.081 us | 0.072 us | 34.11 us | 34.01 us | 34.25 us | 0.77 | 0.00 | - | - | - | - | NA |
GetBytes | .NET 5.0 | Greek | 272.78 us | 1.707 us | 1.513 us | 272.91 us | 270.64 us | 275.45 us | 1.00 | 0.00 | 40.9483 | 40.9483 | 40.9483 | 131792 B | 1.00 |
GetBytes | .NET 7.0 | Greek | 268.80 us | 2.116 us | 1.979 us | 268.40 us | 265.05 us | 271.83 us | 0.99 | 0.01 | 41.3136 | 41.3136 | 41.3136 | 131807 B | 1.00 |
GetBytes | .NET 8.0 | Greek | 264.43 us | 1.487 us | 1.319 us | 264.35 us | 261.96 us | 266.61 us | 0.97 | 0.01 | 40.9483 | 40.9483 | 40.9483 | 131806 B | 1.00 |
GetString | .NET 5.0 | Greek | 422.14 us | 4.472 us | 4.183 us | 421.95 us | 416.77 us | 429.72 us | 1.00 | 0.00 | 52.3649 | 52.3649 | 52.3649 | 169352 B | 1.00 |
GetString | .NET 7.0 | Greek | 427.54 us | 3.342 us | 2.963 us | 426.73 us | 424.04 us | 433.39 us | 1.01 | 0.01 | 52.3649 | 52.3649 | 52.3649 | 169370 B | 1.00 |
GetString | .NET 8.0 | Greek | 423.67 us | 2.330 us | 2.179 us | 423.65 us | 420.46 us | 428.13 us | 1.00 | 0.01 | 50.9868 | 50.9868 | 50.9868 | 169370 B | 1.00 |
It looks like GetByteCount
has improved but some of the GetString
cases have regressed.
The original issue reported regression when compared with .NET 3.1 (@adamsitnik do you remember if that is accurate)? If so we might need to compare with .NET 3.1. At that time, we just had linux arm64 though, so you will have to test it on linux arm64 box.
@adamsitnik do you remember if that is accurate)
I don't remember the details, but looking at my old description of the issue I am sure that you are right, it was a 3.1 vs 5.0 regression found on Ubuntu machines (the ones owned by the JIT Team, as back then I had no access to any other arm machines).
This was on a linux ARM64 box.
.NET 3.1 results: | Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 34.04 us | 0.067 us | 0.062 us | 34.04 us | 33.94 us | 34.16 us | - | - | - | - | |
GetBytes | EnglishAllAscii | 82.25 us | 0.189 us | 0.168 us | 82.25 us | 82.07 us | 82.65 us | 49.7382 | 49.7382 | 49.7382 | 163840 B | |
GetString | EnglishAllAscii | 71.68 us | 0.313 us | 0.261 us | 71.72 us | 71.35 us | 72.08 us | 99.7159 | 99.7159 | 99.7159 | 327648 B | |
GetByteCount | EnglishMostlyAscii | 93.30 us | 0.840 us | 0.785 us | 93.60 us | 91.36 us | 93.65 us | - | - | - | - | |
GetBytes | EnglishMostlyAscii | 237.55 us | 0.298 us | 0.279 us | 237.53 us | 237.14 us | 237.99 us | 51.7857 | 51.7857 | 51.7857 | 169880 B | |
GetString | EnglishMostlyAscii | 217.55 us | 0.410 us | 0.384 us | 217.46 us | 216.93 us | 218.19 us | 99.8264 | 99.8264 | 99.8264 | 327656 B | |
GetByteCount | Chinese | 41.65 us | 0.364 us | 0.340 us | 41.42 us | 41.41 us | 42.13 us | - | - | - | - | |
GetBytes | Chinese | 199.19 us | 0.366 us | 0.306 us | 199.07 us | 198.84 us | 199.82 us | 55.3797 | 55.3797 | 55.3797 | 177752 B | |
GetString | Chinese | 325.68 us | 0.185 us | 0.173 us | 325.67 us | 325.31 us | 326.04 us | 46.8750 | 46.8750 | 46.8750 | 150112 B | |
GetByteCount | Cyrillic | 36.38 us | 0.005 us | 0.004 us | 36.38 us | 36.37 us | 36.39 us | - | - | - | - | |
GetBytes | Cyrillic | 163.33 us | 0.059 us | 0.046 us | 163.33 us | 163.25 us | 163.43 us | 30.6122 | 30.6122 | 30.6122 | 100880 B | |
GetString | Cyrillic | 225.84 us | 0.401 us | 0.313 us | 225.91 us | 225.26 us | 226.32 us | 39.8551 | 39.8551 | 39.8551 | 130856 B | |
GetByteCount | Greek | 46.46 us | 0.005 us | 0.005 us | 46.46 us | 46.46 us | 46.47 us | - | - | - | - | |
GetBytes | Greek | 240.81 us | 0.787 us | 0.736 us | 240.88 us | 239.89 us | 242.25 us | 39.4231 | 39.4231 | 39.4231 | 129248 B | |
GetString | Greek | 346.43 us | 0.396 us | 0.331 us | 346.44 us | 345.90 us | 347.13 us | 48.6111 | 48.6111 | 48.6111 | 164264 B |
.NET 7 results: | Method | Input | Mean | Error | StdDev | Median | Min | Max | Gen0 | Gen1 | Gen2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|---|
GetByteCount | EnglishAllAscii | 10.60 us | 0.002 us | 0.002 us | 10.60 us | 10.59 us | 10.60 us | - | - | - | - | |
GetBytes | EnglishAllAscii | 44.57 us | 0.058 us | 0.054 us | 44.58 us | 44.47 us | 44.65 us | 49.9302 | 49.9302 | 49.9302 | 163874 B | |
GetString | EnglishAllAscii | 59.10 us | 0.193 us | 0.181 us | 59.02 us | 58.86 us | 59.48 us | 99.8134 | 99.8134 | 99.8134 | 327715 B | |
GetByteCount | EnglishMostlyAscii | 73.94 us | 0.096 us | 0.090 us | 73.90 us | 73.81 us | 74.12 us | - | - | - | - | |
GetBytes | EnglishMostlyAscii | 179.52 us | 0.194 us | 0.172 us | 179.52 us | 179.20 us | 179.80 us | 52.5568 | 52.5568 | 52.5568 | 169916 B | |
GetString | EnglishMostlyAscii | 184.24 us | 0.204 us | 0.159 us | 184.25 us | 183.97 us | 184.44 us | 99.2647 | 99.2647 | 99.2647 | 327723 B | |
GetByteCount | Chinese | 34.35 us | 0.035 us | 0.033 us | 34.33 us | 34.32 us | 34.40 us | - | - | - | - | |
GetBytes | Chinese | 192.66 us | 0.160 us | 0.134 us | 192.66 us | 192.45 us | 192.88 us | 54.8780 | 54.8780 | 54.8780 | 177790 B | |
GetString | Chinese | 309.72 us | 0.156 us | 0.139 us | 309.71 us | 309.50 us | 309.94 us | 46.5686 | 46.5686 | 46.5686 | 150144 B | |
GetByteCount | Cyrillic | 21.09 us | 0.033 us | 0.027 us | 21.08 us | 21.08 us | 21.17 us | - | - | - | - | |
GetBytes | Cyrillic | 132.44 us | 0.189 us | 0.168 us | 132.41 us | 132.24 us | 132.79 us | 30.9874 | 30.9874 | 30.9874 | 100901 B | |
GetString | Cyrillic | 215.88 us | 0.380 us | 0.337 us | 215.85 us | 215.41 us | 216.58 us | 39.9306 | 39.9306 | 39.9306 | 130884 B | |
GetByteCount | Greek | 38.45 us | 0.292 us | 0.273 us | 38.60 us | 37.94 us | 38.70 us | - | - | - | - | |
GetBytes | Greek | 212.63 us | 0.644 us | 0.602 us | 212.42 us | 211.87 us | 214.07 us | 39.3836 | 39.3836 | 39.3836 | 129275 B | |
GetString | Greek | 331.28 us | 0.825 us | 0.731 us | 331.00 us | 330.44 us | 333.17 us | 49.2021 | 49.2021 | 49.2021 | 164298 B |
.NET 7 is an all-up improvement over .NET 3.1 results. cc @kunalspathak
Closing as these are not regressions anymore.
Thanks @tihan for checking this.
After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to
Utf8Encoding
. They are alll reproducible and I've verified that it's not a matter of loop alignment (by running them with--envVars COMPlus_JitAlignLoops:1
).It looks like it's ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).
Repro
BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04 Unknown processor [Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
Docs
Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository
cc @kunalspathak @carlossanlop @pgovind @tannergooding