dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.17k stars 4.72k forks source link

[ARM64] Performance regression: Utf8Encoding #41699

Closed adamsitnik closed 1 year ago

adamsitnik commented 4 years ago

After running benchmarks for 3.1 vs 5.0 using "Ubuntu arm64 Qualcomm Machines" owned by the JIT Team, I've found few regressions related to Utf8Encoding. They are alll reproducible and I've verified that it's not a matter of loop alignment (by running them with --envVars COMPlus_JitAlignLoops:1).

It looks like it's ARM64 specific regression, I was not able to reproduce it for ARM (the 32 bit variant).

Repro

git clone https://github.com/dotnet/performance.git
python3 ./performance/scripts/benchmarks_ci.py -f netcoreapp3.1 netcoreapp5.0 --architecture arm64 --filter Perf_Utf8Encoding

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 16.04 Unknown processor [Host] : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-VTSQOV : .NET Core 3.1.8 (CoreCLR 4.700.20.41105, CoreFX 4.700.20.41903), Arm64 RyuJIT Job-RAMSQZ : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

Method Runtime Input Mean Ratio Allocated
GetByteCount .NET Core 3.1 EnglishAllAscii 38.00 us 1.00 -
GetByteCount .NET Core 5.0 EnglishAllAscii 40.66 us 1.07 -
GetBytes .NET Core 3.1 EnglishAllAscii 101.09 us 1.00 163840 B
GetBytes .NET Core 5.0 EnglishAllAscii 104.96 us 1.04 163855 B
GetString .NET Core 3.1 EnglishAllAscii 103.47 us 1.00 327648 B
GetString .NET Core 5.0 EnglishAllAscii 95.76 us 0.93 327677 B
GetByteCount .NET Core 3.1 EnglishMostlyAscii 117.50 us 1.00 -
GetByteCount .NET Core 5.0 EnglishMostlyAscii 221.40 us 1.88 -
GetBytes .NET Core 3.1 EnglishMostlyAscii 273.49 us 1.00 169880 B
GetBytes .NET Core 5.0 EnglishMostlyAscii 377.67 us 1.38 169895 B
GetString .NET Core 3.1 EnglishMostlyAscii 262.55 us 1.00 327656 B
GetString .NET Core 5.0 EnglishMostlyAscii 250.18 us 0.95 327685 B
GetByteCount .NET Core 3.1 Chinese 53.34 us 1.00 -
GetByteCount .NET Core 5.0 Chinese 90.21 us 1.69 -
GetBytes .NET Core 3.1 Chinese 245.94 us 1.00 177752 B
GetBytes .NET Core 5.0 Chinese 279.62 us 1.14 177768 B
GetString .NET Core 3.1 Chinese 373.80 us 1.00 150112 B
GetString .NET Core 5.0 Chinese 358.11 us 0.96 150126 B
GetByteCount .NET Core 3.1 Cyrillic 45.35 us 1.00 -
GetByteCount .NET Core 5.0 Cyrillic 76.01 us 1.68 -
GetBytes .NET Core 3.1 Cyrillic 193.34 us 1.00 100880 B
GetBytes .NET Core 5.0 Cyrillic 222.10 us 1.15 100889 B
GetString .NET Core 3.1 Cyrillic 262.69 us 1.00 130856 B
GetString .NET Core 5.0 Cyrillic 259.83 us 0.99 130868 B
GetByteCount .NET Core 3.1 Greek 58.36 us 1.00 -
GetByteCount .NET Core 5.0 Greek 97.41 us 1.67 -
GetBytes .NET Core 3.1 Greek 275.88 us 1.00 129248 B
GetBytes .NET Core 5.0 Greek 314.00 us 1.14 129260 B
GetString .NET Core 3.1 Greek 394.55 us 1.00 164264 B
GetString .NET Core 5.0 Greek 394.35 us 1.00 164278 B

Docs

Profiling workflow for dotnet/runtime repository Benchmarking workflow for dotnet/runtime repository

cc @kunalspathak @carlossanlop @pgovind @tannergooding

ghost commented 4 years ago

Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.

tarekgh commented 4 years ago

I have changed the tag to jit for now as this is most likely not the UTF8Encoding code itself. if proven otherwise, please re-tag it with encoding label again.

kunalspathak commented 4 years ago

@jeffhandley , @BruceForstall , @JulieLeeMSFT

pgovind commented 4 years ago

Just adding a note here: Despite what the name suggests, the 4 regressions listed here are likely from methods in UTF8Utility and/or ASCIIUtility. The GetString benchmark doesn't seem to show any improvements, but it's not straightforward to reverse the changes that this benchmark hits because the UTF8Utility and ASCIIUtility methods are highly coupled and they do show decent speedup in the other benchmarks.

JulieLeeMSFT commented 4 years ago

@echesakovMSFT please look into this.

CC @AndyAyersMS

echesakov commented 4 years ago

For the following simple repro

robox@DDARM64S-003:~/echesako/Runtime_41699$ cat Program.cs
using System.IO;
using System.Runtime.CompilerServices;
using System.Text;

namespace Runtime_41699
{
    public class Program
    {
        public static void Main()
        {
            string unicode;
            byte[] bytes;
            UTF8Encoding utf8Encoding;

            unicode = File.ReadAllText("/home/robox/echesako/Runtime_41699/EnglishMostlyAscii.txt");
            utf8Encoding = new UTF8Encoding();
            bytes = utf8Encoding.GetBytes(unicode);

            while (true)
            {
                 Consume(utf8Encoding.GetByteCount(unicode));
            }
        }

//      public int GetByteCount() => _utf8Encoding.GetByteCount(_unicode);
//      public byte[] GetBytes() => _utf8Encoding.GetBytes(_unicode);
//      public string GetString() => _utf8Encoding.GetString(_bytes);

        [MethodImpl(MethodImplOptions.NoInlining)]
        private static void Consume<T>(in T _) { }
    }
}

I am seeing 4 times more cache-misses in net5.0 and twice more stalled-cycles-backend. The following counter stat collections are done for the loop only.

robox@DDARM64S-003:~/echesako/Runtime_41699$ cat netcoreapp3.1-Runtime_41699.txt
# started on Fri Sep  4 11:37:51 2020

 Performance counter stats for process id '35473':

           351,895      branch-misses
         5,009,414      cache-misses
    77,972,050,461      cpu-cycles
   157,542,185,470      instructions              #    2.02  insn per cycle
                                                  #    0.03  stalled cycles per insn
        66,244,057      stalled-cycles-frontend   #    0.08% frontend cycles idle
     4,396,095,861      stalled-cycles-backend    #    5.64% backend cycles idle

      30.004692431 seconds time elapsed

robox@DDARM64S-003:~/echesako/Runtime_41699$ cat net5.0-Runtime_41699.txt
# started on Fri Sep  4 11:39:04 2020

 Performance counter stats for process id '35507':

           270,958      branch-misses
        21,878,633      cache-misses
    77,971,943,239      cpu-cycles
    98,498,800,981      instructions              #    1.26  insn per cycle
                                                  #    0.10  stalled cycles per insn
        58,217,904      stalled-cycles-frontend   #    0.07% frontend cycles idle
     9,625,981,589      stalled-cycles-backend    #   12.35% backend cycles idle

      30.005090846 seconds time elapsed
echesakov commented 4 years ago

Combining PopCount with GetNonAsciiBytes in Utf16Utility.GetPointerToFirstInvalidChar

diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..73a1ea29bec 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()

                         Vector128<ushort> charIsThreeByteUtf8Encoded;
                         uint mask;
+                        uint popcnt;

                         if (AdvSimd.IsSupported)
                         {
                             charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
-                            mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+                            popcnt = GetNonAsciiBytesAndPopCount(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
                         }
                         else
                         {
                             charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
                             mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+                            popcnt = (uint)BitOperations.PopCount(mask);
                         }

                         // Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
                         // unpaired surrogates in our data. (Unpaired surrogates would invalidate
                         // our computed result and we'd have to throw it away.)

-                        uint popcnt = (uint)BitOperations.PopCount(mask);
-
                         // Surrogates need to be special-cased for two reasons: (a) we need
                         // to account for the fact that we over-counted in the addition above;
                         // and (b) they require separate validation.
@@ -485,6 +485,22 @@ static Utf16Utility()
             return pInputBuffer;
         }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static uint GetNonAsciiBytesAndPopCount(Vector128<byte> value, Vector128<byte> bitMask128)
+        {
+            Debug.Assert(AdvSimd.Arm64.IsSupported);
+
+            Vector128<byte> mostSignificantBitIsSet = AdvSimd.ShiftRightArithmetic(value.AsSByte(), 7).AsByte();
+            Vector128<byte> extractedBits = AdvSimd.And(mostSignificantBitIsSet, bitMask128);
+
+            // self-pairwise add until all flags have moved to the first two bytes of the vector
+            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+            extractedBits = AdvSimd.Arm64.AddPairwise(extractedBits, extractedBits);
+            Vector128<byte> popcnt = AdvSimd.PopCount(extractedBits);
+            return AdvSimd.Arm64.AddPairwise(popcnt, popcnt).ToScalar();
+        }
+
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
         {

seems to help a little bit with stalled-cycles-backend

robox@DDARM64S-003:~/echesako/Runtime_41699$ perf stat -e "branch-misses,cache-misses,cpu-cycles,instructions,stalled-cycles-frontend,stalled-cycles-backend" -p 35949 sleep 30

 Performance counter stats for process id '35949':

           287,604      branch-misses
        24,786,695      cache-misses
    77,971,487,139      cpu-cycles
    95,056,853,183      instructions              #    1.22  insn per cycle
                                                  #    0.07  stalled cycles per insn
        58,614,087      stalled-cycles-frontend   #    0.08% frontend cycles idle
     6,650,628,910      stalled-cycles-backend    #    8.53% backend cycles idle

      30.005114026 seconds time elapsed

This avoid moving mask back and forth between SIMD and general-purpose registers files.

echesakov commented 4 years ago

Below measurement are done on

processor       : 0
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 1
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 2
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

processor       : 3
model name      : ARMv8 Processor rev 1 (v8l)
BogoMIPS        : 38.40
Features        : fp asimd evtstrm aes pmull sha1 sha2 crc32
CPU implementer : 0x41
CPU architecture: 8
CPU variant     : 0x1
CPU part        : 0xd07
CPU revision    : 1

.NET Core 3.1.6


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT
  Job-VJGWPE : .NET Core 3.1.6 (CoreCLR 4.700.20.26901, CoreFX 4.700.20.31603), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 3.1  Arguments=/p:DebugType=portable
Toolchain=netcoreapp3.1  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 118.9 μs 0.24 μs 0.21 μs 118.9 μs 118.6 μs 119.4 μs - - - -
GetBytes EnglishAllAscii 268.4 μs 2.14 μs 1.78 μs 267.6 μs 266.4 μs 272.4 μs 49.7881 49.7881 49.7881 163840 B
GetString EnglishAllAscii 322.4 μs 1.78 μs 1.49 μs 321.7 μs 321.5 μs 326.5 μs 99.4898 99.4898 99.4898 327648 B
GetByteCount EnglishMostlyAscii 328.3 μs 0.43 μs 0.39 μs 328.1 μs 328.0 μs 329.3 μs - - - -
GetBytes EnglishMostlyAscii 680.3 μs 0.80 μs 0.62 μs 680.3 μs 679.3 μs 681.5 μs 51.6304 51.6304 51.6304 169880 B
GetString EnglishMostlyAscii 591.1 μs 2.17 μs 1.92 μs 590.2 μs 588.7 μs 594.6 μs 99.5370 99.5370 99.5370 327656 B
GetByteCount Chinese 149.9 μs 0.29 μs 0.26 μs 149.9 μs 149.7 μs 150.5 μs - - - -
GetBytes Chinese 646.8 μs 1.21 μs 0.95 μs 647.0 μs 645.1 μs 647.9 μs 55.0000 55.0000 55.0000 177752 B
GetString Chinese 943.7 μs 2.97 μs 2.63 μs 943.3 μs 940.9 μs 950.1 μs 44.1176 44.1176 44.1176 150112 B
GetByteCount Cyrillic 130.3 μs 0.21 μs 0.19 μs 130.4 μs 130.0 μs 130.7 μs - - - -
GetBytes Cyrillic 487.0 μs 1.30 μs 1.08 μs 486.8 μs 485.2 μs 489.2 μs 29.2969 29.2969 29.2969 100880 B
GetString Cyrillic 648.8 μs 1.74 μs 1.45 μs 649.5 μs 646.6 μs 650.9 μs 39.0625 39.0625 39.0625 130856 B
GetByteCount Greek 163.7 μs 0.08 μs 0.07 μs 163.7 μs 163.6 μs 163.8 μs - - - -
GetBytes Greek 723.1 μs 3.83 μs 3.58 μs 721.8 μs 718.7 μs 728.8 μs 39.7727 39.7727 39.7727 129248 B
GetString Greek 968.3 μs 8.42 μs 7.47 μs 965.3 μs 960.8 μs 980.7 μs 47.7941 47.7941 47.7941 164264 B

NET Core 5.0.0


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-OUMKUS : .NET Core 5.0.0 (CoreCLR 5.0.20.41714, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.6 μs 0.51 μs 0.47 μs 109.3 μs 109.3 μs 110.6 μs - - - -
GetBytes EnglishAllAscii 261.1 μs 2.71 μs 2.53 μs 259.9 μs 258.7 μs 266.3 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 205.6 μs 0.58 μs 0.46 μs 205.6 μs 204.8 μs 206.3 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 565.9 μs 0.93 μs 0.78 μs 565.9 μs 564.2 μs 567.2 μs - - - 1 B
GetBytes EnglishMostlyAscii 912.2 μs 1.65 μs 1.29 μs 912.0 μs 910.3 μs 914.8 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 574.2 μs 1.69 μs 1.32 μs 573.6 μs 572.5 μs 576.9 μs 99.5370 99.5370 99.5370 327685 B
GetByteCount Chinese 258.3 μs 0.83 μs 0.77 μs 257.9 μs 257.5 μs 259.8 μs - - - -
GetBytes Chinese 749.3 μs 3.55 μs 3.14 μs 747.9 μs 746.7 μs 756.1 μs 53.5714 53.5714 53.5714 177768 B
GetString Chinese 896.2 μs 9.65 μs 9.03 μs 891.6 μs 889.5 μs 914.0 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 223.8 μs 0.46 μs 0.39 μs 223.6 μs 223.4 μs 224.8 μs - - - -
GetBytes Cyrillic 592.1 μs 4.66 μs 4.13 μs 590.1 μs 588.8 μs 602.4 μs 30.0926 30.0926 30.0926 100889 B
GetString Cyrillic 630.9 μs 1.32 μs 1.17 μs 630.6 μs 629.6 μs 633.7 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 281.7 μs 0.35 μs 0.31 μs 281.8 μs 281.2 μs 282.3 μs - - - -
GetBytes Greek 844.0 μs 2.50 μs 2.09 μs 843.5 μs 841.8 μs 848.9 μs 39.4737 39.4737 39.4737 129260 B
GetString Greek 951.6 μs 10.59 μs 9.91 μs 949.3 μs 941.3 μs 971.0 μs 47.7941 47.7941 47.7941 164279 B

NET Core 5.0.0 (with the suggested change to Utf16Utility.GetPointerToFirstInvalidChar)


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-XEOYJB : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.2 μs 0.46 μs 0.41 μs 109.0 μs 108.8 μs 110.0 μs - - - -
GetBytes EnglishAllAscii 260.3 μs 2.08 μs 1.74 μs 259.7 μs 259.0 μs 265.2 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 207.3 μs 1.23 μs 1.03 μs 206.9 μs 206.3 μs 209.7 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 414.5 μs 0.49 μs 0.41 μs 414.3 μs 414.3 μs 415.6 μs - - - -
GetBytes EnglishMostlyAscii 753.2 μs 3.51 μs 2.93 μs 752.4 μs 750.2 μs 759.9 μs 50.5952 50.5952 50.5952 169895 B
GetString EnglishMostlyAscii 574.0 μs 8.49 μs 7.94 μs 570.8 μs 566.9 μs 590.0 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 189.1 μs 0.16 μs 0.14 μs 189.1 μs 189.0 μs 189.5 μs - - - -
GetBytes Chinese 675.3 μs 1.07 μs 0.89 μs 675.1 μs 673.9 μs 677.0 μs 54.3478 54.3478 54.3478 177768 B
GetString Chinese 895.7 μs 6.66 μs 6.23 μs 892.0 μs 889.4 μs 904.7 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 164.1 μs 0.11 μs 0.09 μs 164.1 μs 164.0 μs 164.3 μs - - - -
GetBytes Cyrillic 527.7 μs 2.71 μs 2.40 μs 527.0 μs 525.4 μs 533.3 μs 29.1667 29.1667 29.1667 100889 B
GetString Cyrillic 624.7 μs 2.56 μs 2.00 μs 624.6 μs 622.2 μs 630.4 μs 38.4615 38.4615 38.4615 130868 B
GetByteCount Greek 206.9 μs 0.44 μs 0.39 μs 206.7 μs 206.6 μs 208.0 μs - - - -
GetBytes Greek 764.2 μs 3.49 μs 3.27 μs 762.9 μs 760.4 μs 769.6 μs 38.6905 38.6905 38.6905 129260 B
GetString Greek 962.2 μs 4.29 μs 3.59 μs 961.0 μs 958.4 μs 970.9 μs 46.8750 46.8750 46.8750 164279 B

It's clear from the data for GetByteCount benchmark the issue with stalled cycles due to PopCount is one of potentially many causes of the performance regression here. We need to do thorough analysis to discover them all.

I am moving this to .NET 6.0. I don't believe this is a JIT issue, so I am relabeling this back to area-System.Text.Encoding.

cc @JulieLeeMSFT @jeffhandley

ghost commented 4 years ago

Tagging subscribers to this area: @tarekgh, @krwq See info in area-owners.md if you want to be subscribed.

JulieLeeMSFT commented 4 years ago

@jeffhandley I am assigning this to you now.

tarekgh commented 4 years ago

Thanks @echesakovMSFT for your analysis.

CC @GrabYourPitchforks

echesakov commented 4 years ago

One more observation - if I remove the code under if ((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported) in Utf16Utility.GetPointerToFirstInvalidChar but keep the code under else if (Vector.IsHardwareAccelerated) and replace it with if (Vector.IsHardwareAccelerated) which I presume would be true on Arm64 I will get the following results

(AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian)


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-IXQCIC : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.7 μs 0.65 μs 0.60 μs 109.5 μs 109.1 μs 111.0 μs - - - -
GetBytes EnglishAllAscii 263.1 μs 4.69 μs 4.39 μs 260.7 μs 259.0 μs 273.0 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 206.7 μs 0.43 μs 0.36 μs 206.6 μs 206.2 μs 207.5 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 566.1 μs 2.25 μs 2.11 μs 564.5 μs 563.8 μs 568.9 μs - - - 1 B
GetBytes EnglishMostlyAscii 909.5 μs 2.34 μs 1.96 μs 909.3 μs 907.1 μs 914.0 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 573.2 μs 1.73 μs 1.35 μs 572.8 μs 572.1 μs 577.0 μs 99.5370 99.5370 99.5370 327685 B
GetByteCount Chinese 257.7 μs 0.37 μs 0.33 μs 257.7 μs 257.0 μs 258.1 μs - - - -
GetBytes Chinese 749.4 μs 3.56 μs 2.97 μs 748.3 μs 746.2 μs 756.4 μs 53.5714 53.5714 53.5714 177768 B
GetString Chinese 899.9 μs 7.84 μs 7.33 μs 895.3 μs 893.2 μs 913.5 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 223.9 μs 0.20 μs 0.18 μs 223.9 μs 223.5 μs 224.0 μs - - - -
GetBytes Cyrillic 593.4 μs 3.35 μs 2.97 μs 592.4 μs 590.1 μs 598.7 μs 30.0926 30.0926 30.0926 100889 B
GetString Cyrillic 630.6 μs 1.14 μs 0.95 μs 630.8 μs 628.6 μs 631.7 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 280.7 μs 0.95 μs 0.84 μs 280.3 μs 279.6 μs 282.1 μs - - - -
GetBytes Greek 844.0 μs 2.82 μs 2.35 μs 843.8 μs 840.8 μs 849.3 μs 39.4737 39.4737 39.4737 129260 B
GetString Greek 963.8 μs 1.75 μs 1.46 μs 964.2 μs 960.2 μs 965.7 μs 47.7941 47.7941 47.7941 164279 B

(Vector.IsHardwareAccelerated)


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-ECUDLG : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.2 μs 0.19 μs 0.16 μs 109.2 μs 109.0 μs 109.6 μs - - - -
GetBytes EnglishAllAscii 263.9 μs 0.79 μs 0.66 μs 263.6 μs 263.3 μs 265.2 μs 49.5690 49.5690 49.5690 163855 B
GetString EnglishAllAscii 206.2 μs 1.16 μs 0.96 μs 205.9 μs 205.0 μs 208.5 μs 99.3151 99.3151 99.3151 327677 B
GetByteCount EnglishMostlyAscii 274.0 μs 0.23 μs 0.18 μs 274.0 μs 273.5 μs 274.2 μs - - - -
GetBytes EnglishMostlyAscii 612.1 μs 2.85 μs 2.38 μs 611.3 μs 610.3 μs 618.0 μs 50.4808 50.4808 50.4808 169895 B
GetString EnglishMostlyAscii 573.9 μs 4.85 μs 4.53 μs 572.0 μs 569.4 μs 583.5 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 125.3 μs 0.26 μs 0.23 μs 125.3 μs 124.9 μs 125.7 μs - - - -
GetBytes Chinese 615.4 μs 2.08 μs 1.84 μs 615.8 μs 610.7 μs 617.8 μs 55.2885 55.2885 55.2885 177769 B
GetString Chinese 900.0 μs 6.04 μs 5.65 μs 896.8 μs 894.4 μs 912.4 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 108.8 μs 0.07 μs 0.07 μs 108.7 μs 108.7 μs 108.9 μs - - - -
GetBytes Cyrillic 466.2 μs 0.97 μs 0.76 μs 466.3 μs 464.7 μs 467.1 μs 29.4118 29.4118 29.4118 100889 B
GetString Cyrillic 623.3 μs 1.09 μs 0.91 μs 623.3 μs 621.5 μs 624.8 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 136.8 μs 0.15 μs 0.14 μs 136.7 μs 136.6 μs 137.1 μs - - - -
GetBytes Greek 679.6 μs 2.22 μs 1.86 μs 680.1 μs 675.1 μs 682.0 μs 38.0435 38.0435 38.0435 129260 B
GetString Greek 971.7 μs 6.79 μs 6.35 μs 968.5 μs 965.5 μs 984.9 μs 47.7941 47.7941 47.7941 164279 B
kunalspathak commented 4 years ago

One more observation - if I remove the code under if ((AdvSimd.Arm64.IsSupported && BitConverter.IsLittleEndian) || Sse2.IsSupported) in Utf16Utility.GetPointerToFirstInvalidChar but keep the code under else if (Vector.IsHardwareAccelerated) and replace it with if (Vector.IsHardwareAccelerated) which I presume would be true on Arm64 I will get the following results

If I understand it correctly, you are saying that you reverted the changes in GetPointerToFirstInvalidChar () so the code fall backs to Vector.IsHardwareAccelerated (which it was happening before we optimized the method with ARM64 intrinsics), we see the improvements?

kunalspathak commented 4 years ago

From offline conversation with @pgovind and @carlossanlop , I recall that the benchmarks touched the methods improved in https://github.com/dotnet/runtime/pull/39506, https://github.com/dotnet/runtime/pull/39508, https://github.com/dotnet/runtime/pull/39050, https://github.com/dotnet/runtime/pull/39041 and https://github.com/dotnet/runtime/pull/38653 (correct me if I am wrong). While you are there, can you do similar change at other places these PRs touched (specially those methods that mimic SSE logic and can be improved by different algorithm for ARM64) to see if the vectorized implementation was fast enough?

echesakov commented 4 years ago

If I understand it correctly, you are saying that you reverted the changes in GetPointerToFirstInvalidChar () so the code fall backs to Vector.IsHardwareAccelerated (which it was happening before we optimized the method with ARM64 intrinsics), we see the improvements?

@kunalspathak That's right

echesakov commented 4 years ago

From offline conversation with @pgovind and @carlossanlop , I recall that the benchmarks touched the methods improved in #39506, #39508, #39050, #39041 and #38653 (correct me if I am wrong). While you are there, can you do similar change at other places these PRs touched (specially those methods that mimic SSE logic and can be improved by different algorithm for ARM64) to see if the vectorized implementation was fast enough?

@kunalspathak I didn't do exactly what you suggested - I don't see a clear way to undo all the work and switch back to the vectorized implementations. Instead I altered JIT the following way

diff --git a/src/coreclr/src/jit/hwintrinsic.cpp b/src/coreclr/src/jit/hwintrinsic.cpp
index 5723ac8f322..95f8988babc 100644
--- a/src/coreclr/src/jit/hwintrinsic.cpp
+++ b/src/coreclr/src/jit/hwintrinsic.cpp
@@ -277,7 +277,7 @@ NamedIntrinsic HWIntrinsicInfo::lookupId(Compiler*         comp,

     if (strcmp(methodName, "get_IsSupported") == 0)
     {
-        return isIsaSupported ? NI_IsSupported_True : NI_IsSupported_False;
+        return NI_IsSupported_False;
     }
     else if (!isIsaSupported)
     {

and measured before and after the change.

Before- get_IsSupported returns "real" value


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-ISREIK : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 108.9 μs 0.14 μs 0.12 μs 108.8 μs 108.8 μs 109.1 μs - - - -
GetBytes EnglishAllAscii 263.9 μs 2.42 μs 2.15 μs 262.9 μs 261.9 μs 268.4 μs 49.7881 49.7881 49.7881 163855 B
GetString EnglishAllAscii 205.8 μs 0.36 μs 0.30 μs 205.8 μs 205.3 μs 206.4 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 564.6 μs 0.92 μs 0.77 μs 564.3 μs 563.8 μs 566.6 μs - - - 1 B
GetBytes EnglishMostlyAscii 917.1 μs 2.59 μs 2.17 μs 916.2 μs 915.4 μs 922.5 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 569.7 μs 3.87 μs 3.43 μs 568.0 μs 567.1 μs 577.1 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 257.4 μs 0.37 μs 0.31 μs 257.3 μs 257.1 μs 258.2 μs - - - -
GetBytes Chinese 743.4 μs 5.77 μs 5.40 μs 740.3 μs 738.1 μs 751.1 μs 53.5714 53.5714 53.5714 177768 B
GetString Chinese 895.0 μs 1.42 μs 1.11 μs 895.0 μs 893.5 μs 897.9 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 223.9 μs 0.24 μs 0.20 μs 223.9 μs 223.5 μs 224.3 μs - - - -
GetBytes Cyrillic 591.2 μs 2.98 μs 2.64 μs 590.0 μs 588.3 μs 597.4 μs 30.0926 30.0926 30.0926 100889 B
GetString Cyrillic 631.7 μs 4.12 μs 3.86 μs 629.9 μs 627.6 μs 638.2 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 281.5 μs 0.23 μs 0.19 μs 281.5 μs 281.3 μs 282.0 μs - - - -
GetBytes Greek 839.5 μs 5.01 μs 3.91 μs 840.5 μs 827.3 μs 841.6 μs 39.4737 39.4737 39.4737 129260 B
GetString Greek 964.1 μs 3.75 μs 3.13 μs 962.9 μs 960.6 μs 971.3 μs 47.7941 47.7941 47.7941 164279 B

After - get_IsSupported returns false


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=6.0.100-alpha.1.20454.4
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT
  Job-FSEUVY : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.41714), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.3 μs 0.13 μs 0.11 μs 109.2 μs 109.2 μs 109.5 μs - - - -
GetBytes EnglishAllAscii 260.4 μs 0.70 μs 0.54 μs 260.2 μs 259.8 μs 261.4 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 293.1 μs 0.64 μs 0.57 μs 293.1 μs 292.1 μs 294.4 μs 99.5370 99.5370 99.5370 327677 B
GetByteCount EnglishMostlyAscii 274.2 μs 0.49 μs 0.43 μs 274.2 μs 273.6 μs 275.2 μs - - - -
GetBytes EnglishMostlyAscii 632.0 μs 1.85 μs 1.55 μs 631.8 μs 629.7 μs 635.4 μs 52.5000 52.5000 52.5000 169896 B
GetString EnglishMostlyAscii 599.5 μs 2.67 μs 2.23 μs 599.6 μs 597.1 μs 604.7 μs 99.5370 99.5370 99.5370 327685 B
GetByteCount Chinese 125.3 μs 0.16 μs 0.15 μs 125.3 μs 125.1 μs 125.6 μs - - - -
GetBytes Chinese 610.7 μs 3.15 μs 2.63 μs 609.5 μs 608.0 μs 617.2 μs 55.2885 55.2885 55.2885 177769 B
GetString Chinese 907.8 μs 4.83 μs 4.28 μs 906.2 μs 904.2 μs 917.9 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 108.9 μs 0.13 μs 0.11 μs 108.9 μs 108.7 μs 109.0 μs - - - -
GetBytes Cyrillic 478.7 μs 1.72 μs 1.43 μs 478.4 μs 476.2 μs 481.5 μs 30.3030 30.3030 30.3030 100889 B
GetString Cyrillic 659.2 μs 4.27 μs 4.00 μs 657.3 μs 655.1 μs 666.1 μs 39.0625 39.0625 39.0625 130868 B
GetByteCount Greek 137.0 μs 0.20 μs 0.19 μs 137.0 μs 136.6 μs 137.3 μs - - - -
GetBytes Greek 692.5 μs 4.49 μs 4.20 μs 690.8 μs 686.4 μs 699.9 μs 38.0435 38.0435 38.0435 129260 B
GetString Greek 971.4 μs 0.89 μs 0.69 μs 971.2 μs 970.7 μs 972.5 μs 47.7941 47.7941 47.7941 164279 B
echesakov commented 4 years ago

As @GrabYourPitchforks pointed out in Teams chat - we don't need PopCount when computing number of non-ascii characters

@carlossanlop When working on this issue you can consider one of the changes below to eliminate PopCount. Both have almost the same performance characteristics based on this benchmark.

Baseline


BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), Arm64 RyuJIT
  Job-KKGTQA : .NET Core 5.0.0 (CoreCLR 5.0.20.45114, CoreFX 5.0.20.45114), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.4 μs 0.17 μs 0.15 μs 109.4 μs 109.3 μs 109.7 μs - - - -
GetBytes EnglishAllAscii 261.9 μs 3.65 μs 3.42 μs 259.7 μs 259.1 μs 268.4 μs 49.1803 49.1803 49.1803 163854 B
GetString EnglishAllAscii 207.3 μs 2.09 μs 1.85 μs 206.5 μs 205.5 μs 211.2 μs 99.6622 99.6622 99.6622 327677 B
GetByteCount EnglishMostlyAscii 565.0 μs 0.63 μs 0.59 μs 565.2 μs 563.6 μs 565.7 μs - - - 1 B
GetBytes EnglishMostlyAscii 916.0 μs 6.50 μs 6.08 μs 913.0 μs 910.4 μs 927.8 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 567.2 μs 1.91 μs 1.59 μs 567.0 μs 565.6 μs 571.4 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 257.5 μs 0.42 μs 0.37 μs 257.4 μs 256.9 μs 258.2 μs - - - -
GetBytes Chinese 750.7 μs 5.85 μs 5.48 μs 747.1 μs 745.4 μs 762.8 μs 53.5714 53.5714 53.5714 177768 B
GetString Chinese 901.0 μs 11.19 μs 10.47 μs 894.8 μs 891.6 μs 920.5 μs 45.1389 45.1389 45.1389 150127 B
GetByteCount Cyrillic 224.1 μs 0.16 μs 0.14 μs 224.2 μs 223.8 μs 224.3 μs - - - -
GetBytes Cyrillic 583.2 μs 1.17 μs 0.91 μs 583.1 μs 581.2 μs 584.6 μs 30.0926 30.0926 30.0926 100889 B
GetString Cyrillic 634.6 μs 1.79 μs 1.50 μs 634.0 μs 633.0 μs 638.2 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 281.6 μs 0.40 μs 0.35 μs 281.6 μs 281.2 μs 282.5 μs - - - -
GetBytes Greek 836.5 μs 2.49 μs 1.94 μs 836.3 μs 834.1 μs 841.1 μs 39.4737 39.4737 39.4737 129260 B
GetString Greek 951.8 μs 6.22 μs 5.51 μs 951.1 μs 944.0 μs 965.1 μs 47.7941 47.7941 47.7941 164279 B

Vector ops

diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..d395252384a 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()

                         Vector128<ushort> charIsThreeByteUtf8Encoded;
                         uint mask;
+                        uint popcnt;

                         if (AdvSimd.IsSupported)
                         {
                             charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
-                            mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+                            popcnt = CountNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
                         }
                         else
                         {
                             charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
                             mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+                            popcnt = (uint)BitOperations.PopCount(mask);
                         }

                         // Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
                         // unpaired surrogates in our data. (Unpaired surrogates would invalidate
                         // our computed result and we'd have to throw it away.)

-                        uint popcnt = (uint)BitOperations.PopCount(mask);
-
                         // Surrogates need to be special-cased for two reasons: (a) we need
                         // to account for the fact that we over-counted in the addition above;
                         // and (b) they require separate validation.
@@ -485,6 +485,17 @@ static Utf16Utility()
             return pInputBuffer;
         }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static uint CountNonAsciiBytes(Vector128<byte> vec)
+        {
+            vec = AdvSimd.ShiftRightLogical(vec, 7);
+            vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+            vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+            vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+            vec = AdvSimd.Arm64.AddPairwise(vec, vec);
+            return vec.ToScalar();
+        }
+
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
         {

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
  Job-HJYIJI : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.4 μs 0.55 μs 0.48 μs 109.1 μs 108.9 μs 110.3 μs - - - -
GetBytes EnglishAllAscii 262.6 μs 3.82 μs 3.57 μs 260.4 μs 259.5 μs 269.4 μs 49.1803 49.1803 49.1803 163854 B
GetString EnglishAllAscii 205.8 μs 0.44 μs 0.37 μs 205.6 μs 205.4 μs 206.5 μs 99.5066 99.5066 99.5066 327677 B
GetByteCount EnglishMostlyAscii 332.3 μs 0.26 μs 0.23 μs 332.3 μs 332.0 μs 332.8 μs - - - -
GetBytes EnglishMostlyAscii 674.1 μs 1.54 μs 1.37 μs 673.9 μs 672.5 μs 676.7 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 573.1 μs 7.73 μs 6.85 μs 569.8 μs 567.9 μs 590.3 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 151.6 μs 0.11 μs 0.09 μs 151.6 μs 151.5 μs 151.8 μs - - - -
GetBytes Chinese 642.5 μs 2.60 μs 2.17 μs 641.8 μs 640.0 μs 646.9 μs 55.0000 55.0000 55.0000 177769 B
GetString Chinese 897.8 μs 9.84 μs 8.72 μs 892.7 μs 890.3 μs 916.3 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 131.9 μs 0.24 μs 0.20 μs 132.0 μs 131.6 μs 132.2 μs - - - -
GetBytes Cyrillic 485.1 μs 5.13 μs 4.55 μs 482.1 μs 481.5 μs 494.4 μs 30.3030 30.3030 30.3030 100890 B
GetString Cyrillic 626.3 μs 4.51 μs 4.00 μs 624.5 μs 622.5 μs 635.3 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 165.7 μs 0.15 μs 0.12 μs 165.7 μs 165.6 μs 166.0 μs - - - -
GetBytes Greek 707.8 μs 6.85 μs 6.08 μs 707.8 μs 699.7 μs 721.7 μs 38.0435 38.0435 38.0435 129260 B
GetString Greek 967.0 μs 3.19 μs 2.49 μs 966.4 μs 965.1 μs 974.3 μs 47.7941 47.7941 47.7941 164279 B

Vector ops +Scalar ops

diff --git a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
index f2df0ccdf53..c958d7ecee4 100644
--- a/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
+++ b/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs
@@ -146,16 +146,18 @@ static Utf16Utility()

                         Vector128<ushort> charIsThreeByteUtf8Encoded;
                         uint mask;
+                        uint popcnt;

                         if (AdvSimd.IsSupported)
                         {
                             charIsThreeByteUtf8Encoded = AdvSimd.Subtract(vectorZero, AdvSimd.ShiftRightLogical(utf16Data, 11));
-                            mask = GetNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte(), bitMask128);
+                            popcnt = CountNonAsciiBytes(AdvSimd.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
                         }
                         else
                         {
                             charIsThreeByteUtf8Encoded = Sse2.Subtract(vectorZero, Sse2.ShiftRightLogical(utf16Data, 11));
                             mask = (uint)Sse2.MoveMask(Sse2.Or(charIsNonAscii, charIsThreeByteUtf8Encoded).AsByte());
+                            popcnt = (uint)BitOperations.PopCount(mask);
                         }

                         // Each even bit of mask will be 1 only if the char was >= 0x0080,
@@ -182,8 +184,6 @@ static Utf16Utility()
                         // unpaired surrogates in our data. (Unpaired surrogates would invalidate
                         // our computed result and we'd have to throw it away.)

-                        uint popcnt = (uint)BitOperations.PopCount(mask);
-
                         // Surrogates need to be special-cased for two reasons: (a) we need
                         // to account for the fact that we over-counted in the addition above;
                         // and (b) they require separate validation.
@@ -485,6 +485,17 @@ static Utf16Utility()
             return pInputBuffer;
         }

+        [MethodImpl(MethodImplOptions.AggressiveInlining)]
+        private static uint CountNonAsciiBytes(Vector128<byte> vec)
+        {
+            vec = AdvSimd.ShiftRightLogical(vec, 7);
+            ulong temp = AdvSimd.Arm64.AddPairwiseScalar(vec.AsUInt64()).ToScalar();
+            temp += (temp >> 32);
+            temp += (temp >> 16);
+            temp += (temp >> 8);
+            return (byte)temp;
+        }
+
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
         private static uint GetNonAsciiBytes(Vector128<byte> value, Vector128<byte> bitMask128)
         {

BenchmarkDotNet=v0.12.1.1405-nightly, OS=ubuntu 18.04
ARMv8 Processor rev 1 (v8l), 4 logical cores
.NET Core SDK=5.0.100-rc.1.20454.5
  [Host]     : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT
  Job-WVTENM : .NET Core 5.0.0 (CoreCLR 42.42.42.42424, CoreFX 5.0.20.45114), Arm64 RyuJIT

PowerPlanMode=00000000-0000-0000-0000-000000000000  Runtime=.NET Core 5.0  Arguments=/p:DebugType=portable
Toolchain=netcoreapp5.0  IterationTime=250.0000 ms  MaxIterationCount=20
MinIterationCount=15  WarmupCount=1
Method Input Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
GetByteCount EnglishAllAscii 109.3 μs 0.69 μs 0.61 μs 109.1 μs 108.7 μs 110.4 μs - - - -
GetBytes EnglishAllAscii 260.8 μs 1.28 μs 1.07 μs 260.6 μs 259.7 μs 263.5 μs 48.9583 48.9583 48.9583 163854 B
GetString EnglishAllAscii 207.2 μs 2.16 μs 1.80 μs 206.4 μs 205.6 μs 212.1 μs 99.6622 99.6622 99.6622 327677 B
GetByteCount EnglishMostlyAscii 329.4 μs 0.15 μs 0.12 μs 329.4 μs 329.2 μs 329.6 μs - - - -
GetBytes EnglishMostlyAscii 673.6 μs 5.31 μs 4.96 μs 672.2 μs 667.7 μs 681.8 μs 52.0833 52.0833 52.0833 169896 B
GetString EnglishMostlyAscii 566.7 μs 1.20 μs 1.00 μs 566.2 μs 565.7 μs 569.0 μs 98.2143 98.2143 98.2143 327685 B
GetByteCount Chinese 150.4 μs 0.11 μs 0.10 μs 150.4 μs 150.3 μs 150.7 μs - - - -
GetBytes Chinese 636.4 μs 2.00 μs 1.67 μs 635.9 μs 634.3 μs 640.7 μs 55.0000 55.0000 55.0000 177769 B
GetString Chinese 899.4 μs 4.00 μs 3.75 μs 898.8 μs 895.3 μs 908.0 μs 45.1389 45.1389 45.1389 150126 B
GetByteCount Cyrillic 130.6 μs 0.13 μs 0.11 μs 130.6 μs 130.4 μs 130.8 μs - - - -
GetBytes Cyrillic 483.9 μs 2.54 μs 2.25 μs 483.9 μs 480.8 μs 488.9 μs 30.3030 30.3030 30.3030 100889 B
GetString Cyrillic 637.4 μs 4.27 μs 3.99 μs 635.4 μs 633.5 μs 646.2 μs 37.5000 37.5000 37.5000 130868 B
GetByteCount Greek 164.6 μs 0.23 μs 0.21 μs 164.6 μs 164.2 μs 164.9 μs - - - -
GetBytes Greek 704.7 μs 6.24 μs 5.83 μs 702.4 μs 698.1 μs 715.8 μs 38.0435 38.0435 38.0435 129260 B
GetString Greek 948.6 μs 7.42 μs 6.94 μs 947.5 μs 940.8 μs 964.7 μs 46.8750 46.8750 46.8750 164279 B

cc @TamarChristinaArm who can suggest what would be the best choice here.

TamarChristinaArm commented 4 years ago

@echesakovMSFT I'll need to take a closer look, but as an initial assessment the temp += (temp >> 32); should be slightly better if you are generating an ADD with a shifted register. (as in, a single instruction rather than a separate add + shift).

That said looking at the algorithm, do you really need the need the reduction inside the loop? The value seems to really only be a counter. So instead can't you keep the value as a vector128<uint> during the loop and perform the final addp and move to genreg side after the loop and add it to tempUtf8CodeUnitCountAdjustment ?

I think we should look at the function as a whole instead of piece wise.

For instance since the only things done on popcnt are add and sub there's no need to transfer between register files in the loop.

+        private static Vector64<uint> CountNonAsciiBytes(Vector128<byte> vec)

and using AddScalar instead during the loop avoids the transfer as we can do scalar arithmetic on the SIMD side.

GrabYourPitchforks commented 4 years ago

You're right. You could avoid doing it piecemeal inside the loop, but you'd need to use caution to avoid integer overflow. If you assume that the accumulator vector is a Vector128<byte>, then you could run at most 255 loop iterations without risking overflow. Then, before every 256th iteration, you'd horizontal add the vector accumulator elements together and add the result to a running scalar accumulator.

TamarChristinaArm commented 4 years ago

Then, before every 256th iteration, you'd horizontal add the vector accumulator elements together and add the result to a running scalar accumulator.

You can avoid that by using a widening pairwise addition UADDLP (instead of normal pairwise add) till you get a vector128<uint> and then use a widening addition when accumulating into your counter which can be a vector128<ulong>. You'd need to accumulate into two vector128<ulong>. using UADDW{2} and outside the loop add those two up, but that's just a cheap loop epilogue.

I think you can also do it and avoid the extra register pressure by using widening pairwise additionsUADDLP instead of the normal ADDPP to get a single vector128<ulong>. It requires one less register but requires an additional VADD into the counter.

TamarChristinaArm commented 4 years ago

Also usually on architectures that support it you could use a UDOT to get a fast widening accumulation from 16b to 4s by using as the multiplicant a vector of ones. However this would only be beneficial if you needed the accumulate the results as Int. In this case since you'd want a long you can't do the accumulation itself in the dotprod so you'd have to use as the initial value a vector of zeros so have a movi before each call which makes it not really a faster sequence.

I also think it's better to use an AND or BIC here

vec = AdvSimd.ShiftRightLogical(vec, 7);

as USHR is restricted to one NEON pipe where an AND can go in any. You just have to hoist the constant out of the loop.

jeffhandley commented 4 years ago

Based on the recent data, we want to try to at least work around this regression in 5.0.0. I'm not ready to consider it release-blocking, yet, but let's see what the workaround/fix would be.

TamarChristinaArm commented 4 years ago

@jeffhandley Sorry I'm slightly confused, this ticket is about Utf8Encoding but so far me and @echesakovMSFT and @GrabYourPitchforks have been discussing I believe Utf16Enconding. On which one is the big regression?

Both of them have some reasonably simple things you can do without changing the entire algorithm to speed up the common cases the AllEnglishAscii and the AlmostEnglishAscii. Utf8 should be the simplest as that just finds the first non-ascii character.

In the initial test you use to check if you have any non-ascii characters you can optimize the

ulong mask = GetNonAsciiBytes(AdvSimd.LoadVector128(pInputBuffer), bitMask128);
if (mask != 0)

to

smaxp   pInputBuffer.16b, pInputBuffer.16b, pInputBuffer.16b
fmov    synd, pInputBuffer
tst     synd, 0x8080808080808080

to find the index you use https://github.com/dotnet/runtime/pull/39507#discussion_r468097953 which has you restart the calculation but it's fine since you are exiting the loop anyway.

to Utf8 that's a simple modification within the current algorithm that should allow you to more than recover the performance. Utf16 is a bit more complicated but can use the same trick to avoid doing the more expensive operation until needed.

GrabYourPitchforks commented 4 years ago

If we really need to do something in 5.0 and we're running out of runway then the absolute safest thing to do would be to change the one line:

https://github.com/dotnet/runtime/blob/7d0d37001c5a0ac1f343cd35b27fea0ebf7e8101/src/libraries/System.Private.CoreLib/src/System/Text/Unicode/Utf16Utility.Validation.cs#L82

To:

if (Sse2.IsSupported)

This will cause the UTF8Encoding.GetByteCount(string) method to fall back to the existing Vector<T>-based code paths as they existed in 3.1 instead of using the new intrinsics that were introduced in 5.0.

I assume that if we want to do this then we'd schedule a "proper" fix to come in during 5.0.1.

GrabYourPitchforks commented 4 years ago

Also, here's a magic decoder ring for the perf tests.

The test method GetByteCount calls Utf16Utility.GetPointerToFirstInvalidChar.

The test method GetBytes calls Utf16Utility.GetPointerToFirstInvalidChar, then calls Utf8Utility.TranscodeToUtf8.

The test method GetString calls Utf8Utility.GetPointerToFirstInvalidByte, then calls Utf8Utility.TranscodeToUtf16.

jeffhandley commented 4 years ago

On which one is the big regression?

Yeah, the GetByteCount regression is the one that was noteworthy to me, @adamsitnik, and @tannergooding--that one regressed across the board and we should consider addressing that in RC2.

Falling back to the non-vectorized path, as @GrabYourPitchforks suggested, has a low enough risk that it could be considered for Ask Mode (given the existing test coverage of that). And then as @GrabYourPitchforks stated, we could pursue a more complete fix after 5.0.0.

kunalspathak commented 4 years ago

There are regressions in GetBytes() as well, and it calls Utf8Utility.TranscodeToUtf8 as well, but I assume we don't want to revert it because that benchmark also calls Utf16Utility.GetPointerToFirstInvalidChar which we are reverting anyway?

GrabYourPitchforks commented 4 years ago

I opened https://github.com/dotnet/performance/issues/1512 to track changing the benchmarks so that each benchmark is testing exactly one worker function. But the best evidence we have right now suggests that GetPointerToFirstInvalidChar is the bulk of the regression, so that's where the efforts / reversions are currently being focused.

jeffhandley commented 2 years ago

Re-opening this issue as #42052 was meant to be a temporary workaround and the underlying issue is still open

akoeplinger commented 2 years ago

@jeffhandley should this still be in the 5.0.0 milestone or be moved to 6.0/7.0?

JulieLeeMSFT commented 1 year ago

Assigning to @TIHan.

TIHan commented 1 year ago

I have some data comparing .NET 5, 7, 8:

BenchmarkDotNet=v0.13.2.2052-nightly, OS=Windows 11 (10.0.22000.1574/21H2) Snapdragon 7c 2.40 GHz, 1 CPU, 8 logical and 8 physical cores .NET SDK=8.0.100-preview.2.23157.25 [Host] : .NET 7.0.4 (7.0.423.11508), Arm64 RyuJIT AdvSIMD Job-YXVEBW : .NET 5.0.17 (5.0.1722.21314), Arm64 RyuJIT AdvSIMD Job-HYZVQW : .NET 7.0.4 (7.0.423.11508), Arm64 RyuJIT AdvSIMD Job-MMQVIX : .NET 8.0.0 (8.0.23.12803), Arm64 RyuJIT AdvSIMD

PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms MaxIterationCount=20 MinIterationCount=15 WarmupCount=1

Method Runtime Input Mean Error StdDev Median Min Max Ratio RatioSD Gen0 Gen1 Gen2 Allocated Alloc Ratio
GetByteCount .NET 5.0 EnglishAllAscii 36.34 us 0.592 us 0.554 us 36.09 us 35.78 us 37.33 us 1.00 0.00 - - - - NA
GetByteCount .NET 7.0 EnglishAllAscii 12.90 us 0.248 us 0.243 us 12.84 us 12.66 us 13.53 us 0.36 0.01 - - - - NA
GetByteCount .NET 8.0 EnglishAllAscii 12.13 us 0.380 us 0.373 us 12.02 us 11.73 us 13.21 us 0.33 0.01 - - - - NA
GetBytes .NET 5.0 EnglishAllAscii 98.71 us 1.625 us 1.440 us 98.52 us 97.00 us 101.97 us 1.00 0.00 52.4691 52.4691 52.4691 167576 B 1.00
GetBytes .NET 7.0 EnglishAllAscii 82.24 us 1.497 us 1.401 us 82.08 us 79.84 us 84.30 us 0.84 0.02 52.4611 52.4611 52.4611 167594 B 1.00
GetBytes .NET 8.0 EnglishAllAscii 78.99 us 1.570 us 1.468 us 78.70 us 76.72 us 81.75 us 0.80 0.02 52.3897 52.3897 52.3897 167594 B 1.00
GetString .NET 5.0 EnglishAllAscii 89.11 us 1.685 us 1.803 us 89.50 us 86.48 us 91.65 us 1.00 0.00 99.9313 99.9313 99.9313 335120 B 1.00
GetString .NET 7.0 EnglishAllAscii 91.48 us 1.823 us 1.951 us 91.26 us 88.83 us 96.25 us 1.03 0.03 99.6429 99.6429 99.6429 335154 B 1.00
GetString .NET 8.0 EnglishAllAscii 88.26 us 1.677 us 1.722 us 88.63 us 85.71 us 90.54 us 0.99 0.02 99.7268 99.7268 99.7268 335154 B 1.00
GetByteCount .NET 5.0 EnglishMostlyAscii 89.67 us 1.470 us 1.375 us 89.46 us 87.98 us 92.34 us 1.00 0.00 - - - - NA
GetByteCount .NET 7.0 EnglishMostlyAscii 71.91 us 0.320 us 0.299 us 71.75 us 71.60 us 72.40 us 0.80 0.01 - - - - NA
GetByteCount .NET 8.0 EnglishMostlyAscii 68.23 us 0.530 us 0.469 us 68.00 us 67.77 us 69.29 us 0.76 0.01 - - - - NA
GetBytes .NET 5.0 EnglishMostlyAscii 221.88 us 1.747 us 1.634 us 221.87 us 219.81 us 225.12 us 1.00 0.00 52.0833 52.0833 52.0833 173616 B 1.00
GetBytes .NET 7.0 EnglishMostlyAscii 207.47 us 1.624 us 1.519 us 207.27 us 205.11 us 210.26 us 0.94 0.01 52.5000 52.5000 52.5000 173634 B 1.00
GetBytes .NET 8.0 EnglishMostlyAscii 206.32 us 4.211 us 4.850 us 204.08 us 201.10 us 216.76 us 0.94 0.02 51.9481 51.9481 51.9481 173634 B 1.00
GetString .NET 5.0 EnglishMostlyAscii 265.13 us 18.047 us 20.783 us 255.43 us 242.75 us 314.89 us 1.00 0.00 87.8906 87.8906 87.8906 335126 B 1.00
GetString .NET 7.0 EnglishMostlyAscii 289.28 us 16.755 us 19.295 us 280.65 us 267.11 us 333.52 us 1.10 0.09 80.0000 80.0000 80.0000 335152 B 1.00
GetString .NET 8.0 EnglishMostlyAscii 408.62 us 117.412 us 135.211 us 405.00 us 264.96 us 668.61 us 1.55 0.52 80.5288 80.5288 80.5288 335152 B 1.00
GetByteCount .NET 5.0 Chinese 40.90 us 0.258 us 0.215 us 40.83 us 40.60 us 41.33 us 1.00 0.00 - - - - NA
GetByteCount .NET 7.0 Chinese 33.29 us 0.187 us 0.166 us 33.29 us 33.09 us 33.64 us 0.81 0.01 - - - - NA
GetByteCount .NET 8.0 Chinese 31.50 us 0.222 us 0.197 us 31.40 us 31.33 us 31.91 us 0.77 0.01 - - - - NA
GetBytes .NET 5.0 Chinese 228.18 us 1.467 us 1.225 us 228.20 us 226.42 us 229.93 us 1.00 0.00 55.2536 55.2536 55.2536 180680 B 1.00
GetBytes .NET 7.0 Chinese 242.76 us 1.883 us 1.761 us 242.37 us 240.12 us 246.59 us 1.06 0.01 54.8077 54.8077 54.8077 180699 B 1.00
GetBytes .NET 8.0 Chinese 235.66 us 2.260 us 1.887 us 235.10 us 234.21 us 240.67 us 1.03 0.01 55.0373 55.0373 55.0373 180699 B 1.00
GetString .NET 5.0 Chinese 366.05 us 2.145 us 1.901 us 365.51 us 363.91 us 370.45 us 1.00 0.00 46.5116 46.5116 46.5116 155960 B 1.00
GetString .NET 7.0 Chinese 384.52 us 1.771 us 1.570 us 383.93 us 382.36 us 387.29 us 1.05 0.01 47.2561 47.2561 47.2561 155977 B 1.00
GetString .NET 8.0 Chinese 394.77 us 3.329 us 3.114 us 394.23 us 391.20 us 401.45 us 1.08 0.01 46.8750 46.8750 46.8750 155976 B 1.00
GetByteCount .NET 5.0 Cyrillic 29.29 us 0.322 us 0.285 us 29.15 us 29.08 us 29.97 us 1.00 0.00 - - - - NA
GetByteCount .NET 7.0 Cyrillic 27.10 us 0.191 us 0.169 us 27.06 us 26.88 us 27.44 us 0.93 0.01 - - - - NA
GetByteCount .NET 8.0 Cyrillic 23.12 us 0.127 us 0.113 us 23.09 us 22.98 us 23.30 us 0.79 0.01 - - - - NA
GetBytes .NET 5.0 Cyrillic 176.68 us 1.383 us 1.226 us 176.72 us 174.90 us 178.73 us 1.00 0.00 31.9767 31.9767 31.9767 102272 B 1.00
GetBytes .NET 7.0 Cyrillic 173.70 us 1.335 us 1.115 us 173.50 us 172.31 us 176.47 us 0.98 0.01 31.9444 31.9444 31.9444 102283 B 1.00
GetBytes .NET 8.0 Cyrillic 169.95 us 1.239 us 1.098 us 169.83 us 167.53 us 172.34 us 0.96 0.01 31.9293 31.9293 31.9293 102283 B 1.00
GetString .NET 5.0 Cyrillic 264.45 us 1.351 us 1.128 us 264.30 us 262.80 us 266.54 us 1.00 0.00 41.3136 41.3136 41.3136 133640 B 1.00
GetString .NET 7.0 Cyrillic 271.45 us 1.796 us 1.680 us 271.21 us 268.76 us 274.51 us 1.03 0.01 40.9483 40.9483 40.9483 133654 B 1.00
GetString .NET 8.0 Cyrillic 268.52 us 1.359 us 1.135 us 268.45 us 266.14 us 270.14 us 1.02 0.01 40.9483 40.9483 40.9483 133654 B 1.00
GetByteCount .NET 5.0 Greek 44.41 us 0.245 us 0.204 us 44.34 us 44.22 us 44.97 us 1.00 0.00 - - - - NA
GetByteCount .NET 7.0 Greek 36.18 us 0.175 us 0.137 us 36.14 us 36.01 us 36.49 us 0.81 0.01 - - - - NA
GetByteCount .NET 8.0 Greek 34.11 us 0.081 us 0.072 us 34.11 us 34.01 us 34.25 us 0.77 0.00 - - - - NA
GetBytes .NET 5.0 Greek 272.78 us 1.707 us 1.513 us 272.91 us 270.64 us 275.45 us 1.00 0.00 40.9483 40.9483 40.9483 131792 B 1.00
GetBytes .NET 7.0 Greek 268.80 us 2.116 us 1.979 us 268.40 us 265.05 us 271.83 us 0.99 0.01 41.3136 41.3136 41.3136 131807 B 1.00
GetBytes .NET 8.0 Greek 264.43 us 1.487 us 1.319 us 264.35 us 261.96 us 266.61 us 0.97 0.01 40.9483 40.9483 40.9483 131806 B 1.00
GetString .NET 5.0 Greek 422.14 us 4.472 us 4.183 us 421.95 us 416.77 us 429.72 us 1.00 0.00 52.3649 52.3649 52.3649 169352 B 1.00
GetString .NET 7.0 Greek 427.54 us 3.342 us 2.963 us 426.73 us 424.04 us 433.39 us 1.01 0.01 52.3649 52.3649 52.3649 169370 B 1.00
GetString .NET 8.0 Greek 423.67 us 2.330 us 2.179 us 423.65 us 420.46 us 428.13 us 1.00 0.01 50.9868 50.9868 50.9868 169370 B 1.00

It looks like GetByteCount has improved but some of the GetString cases have regressed.

kunalspathak commented 1 year ago

The original issue reported regression when compared with .NET 3.1 (@adamsitnik do you remember if that is accurate)? If so we might need to compare with .NET 3.1. At that time, we just had linux arm64 though, so you will have to test it on linux arm64 box.

adamsitnik commented 1 year ago

@adamsitnik do you remember if that is accurate)

I don't remember the details, but looking at my old description of the issue I am sure that you are right, it was a 3.1 vs 5.0 regression found on Ubuntu machines (the ones owned by the JIT Team, as back then I had no access to any other arm machines).

TIHan commented 1 year ago

This was on a linux ARM64 box.

.NET 3.1 results: Method Input Mean Error StdDev Median Min Max Gen0 Gen1 Gen2 Allocated
GetByteCount EnglishAllAscii 34.04 us 0.067 us 0.062 us 34.04 us 33.94 us 34.16 us - - - -
GetBytes EnglishAllAscii 82.25 us 0.189 us 0.168 us 82.25 us 82.07 us 82.65 us 49.7382 49.7382 49.7382 163840 B
GetString EnglishAllAscii 71.68 us 0.313 us 0.261 us 71.72 us 71.35 us 72.08 us 99.7159 99.7159 99.7159 327648 B
GetByteCount EnglishMostlyAscii 93.30 us 0.840 us 0.785 us 93.60 us 91.36 us 93.65 us - - - -
GetBytes EnglishMostlyAscii 237.55 us 0.298 us 0.279 us 237.53 us 237.14 us 237.99 us 51.7857 51.7857 51.7857 169880 B
GetString EnglishMostlyAscii 217.55 us 0.410 us 0.384 us 217.46 us 216.93 us 218.19 us 99.8264 99.8264 99.8264 327656 B
GetByteCount Chinese 41.65 us 0.364 us 0.340 us 41.42 us 41.41 us 42.13 us - - - -
GetBytes Chinese 199.19 us 0.366 us 0.306 us 199.07 us 198.84 us 199.82 us 55.3797 55.3797 55.3797 177752 B
GetString Chinese 325.68 us 0.185 us 0.173 us 325.67 us 325.31 us 326.04 us 46.8750 46.8750 46.8750 150112 B
GetByteCount Cyrillic 36.38 us 0.005 us 0.004 us 36.38 us 36.37 us 36.39 us - - - -
GetBytes Cyrillic 163.33 us 0.059 us 0.046 us 163.33 us 163.25 us 163.43 us 30.6122 30.6122 30.6122 100880 B
GetString Cyrillic 225.84 us 0.401 us 0.313 us 225.91 us 225.26 us 226.32 us 39.8551 39.8551 39.8551 130856 B
GetByteCount Greek 46.46 us 0.005 us 0.005 us 46.46 us 46.46 us 46.47 us - - - -
GetBytes Greek 240.81 us 0.787 us 0.736 us 240.88 us 239.89 us 242.25 us 39.4231 39.4231 39.4231 129248 B
GetString Greek 346.43 us 0.396 us 0.331 us 346.44 us 345.90 us 347.13 us 48.6111 48.6111 48.6111 164264 B
.NET 7 results: Method Input Mean Error StdDev Median Min Max Gen0 Gen1 Gen2 Allocated
GetByteCount EnglishAllAscii 10.60 us 0.002 us 0.002 us 10.60 us 10.59 us 10.60 us - - - -
GetBytes EnglishAllAscii 44.57 us 0.058 us 0.054 us 44.58 us 44.47 us 44.65 us 49.9302 49.9302 49.9302 163874 B
GetString EnglishAllAscii 59.10 us 0.193 us 0.181 us 59.02 us 58.86 us 59.48 us 99.8134 99.8134 99.8134 327715 B
GetByteCount EnglishMostlyAscii 73.94 us 0.096 us 0.090 us 73.90 us 73.81 us 74.12 us - - - -
GetBytes EnglishMostlyAscii 179.52 us 0.194 us 0.172 us 179.52 us 179.20 us 179.80 us 52.5568 52.5568 52.5568 169916 B
GetString EnglishMostlyAscii 184.24 us 0.204 us 0.159 us 184.25 us 183.97 us 184.44 us 99.2647 99.2647 99.2647 327723 B
GetByteCount Chinese 34.35 us 0.035 us 0.033 us 34.33 us 34.32 us 34.40 us - - - -
GetBytes Chinese 192.66 us 0.160 us 0.134 us 192.66 us 192.45 us 192.88 us 54.8780 54.8780 54.8780 177790 B
GetString Chinese 309.72 us 0.156 us 0.139 us 309.71 us 309.50 us 309.94 us 46.5686 46.5686 46.5686 150144 B
GetByteCount Cyrillic 21.09 us 0.033 us 0.027 us 21.08 us 21.08 us 21.17 us - - - -
GetBytes Cyrillic 132.44 us 0.189 us 0.168 us 132.41 us 132.24 us 132.79 us 30.9874 30.9874 30.9874 100901 B
GetString Cyrillic 215.88 us 0.380 us 0.337 us 215.85 us 215.41 us 216.58 us 39.9306 39.9306 39.9306 130884 B
GetByteCount Greek 38.45 us 0.292 us 0.273 us 38.60 us 37.94 us 38.70 us - - - -
GetBytes Greek 212.63 us 0.644 us 0.602 us 212.42 us 211.87 us 214.07 us 39.3836 39.3836 39.3836 129275 B
GetString Greek 331.28 us 0.825 us 0.731 us 331.00 us 330.44 us 333.17 us 49.2021 49.2021 49.2021 164298 B

.NET 7 is an all-up improvement over .NET 3.1 results. cc @kunalspathak

TIHan commented 1 year ago

Closing as these are not regressions anymore.

kunalspathak commented 1 year ago

Thanks @tihan for checking this.