[Perf -17%] System.Collections.ContainsKeyFalse<Int32, Int32>.ImmutableDictionary

performanceautofiler[bot] commented 4 years ago

Run Information

Architecture	x64
OS	ubuntu 18.04
Changes	diff

Regressions in System.Collections.ContainsKeyFalse<Int32, Int32>

Benchmark	Baseline	Test	Test/Base	Modality	Baseline Outlier	Baseline ETL	Comapre ETL
[ImmutableDictionary](<https://pvscmdupload.blob.core.windows.net/reports/allTestHistory/refs/heads/master_x64_ubuntu 18.04/System.Collections.ContainsKeyFalse(Int32%2c%20Int32).ImmutableDictionary(Size%3a%20512).html>)	14.89 μs	17.39 μs	1.17		True

graph Historical Data in Reporting System

Repro

git clone https://github.com/dotnet/performance.git
python3 .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter 'System.Collections.ContainsKeyFalse<Int32, Int32>*'

### Histogram #### System.Collections.ContainsKeyFalse.ImmutableDictionary(Size: 512) ```log [13782.968 ; 14477.519) | @@@@@@@@@@@@@ [14477.519 ; 15154.104) | @@@@@@@@@@@@@@@@@@@@@ [15154.104 ; 15543.818) | [15543.818 ; 16298.754) | @@@@@@ [16298.754 ; 17000.577) | @@@@@@@@@@@@@@@@ [17000.577 ; 17677.162) | @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ ``` ### Docs [Profiling workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/profiling-workflow-dotnet-runtime.md) [Benchmarking workflow for dotnet/runtime repository](https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-dotnet-runtime.md)

kunalspathak commented 3 years ago

Before loop alignment changes:

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictionary	512	17.33 us	0.063 us	0.056 us	17.34 us	17.21 us	17.43 us	-	-	-	-
ImmutableDictionary	512	17.61 us	0.151 us	0.141 us	17.58 us	17.35 us	17.84 us	-	-	-	-
ImmutableDictionary	512	17.58 us	0.121 us	0.107 us	17.57 us	17.42 us	17.81 us	-	-	-	-

After loop alignment changes:

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictionary	512	18.56 us	0.159 us	0.141 us	18.52 us	18.38 us	18.84 us	-	-	-	-
ImmutableDictionary	512	18.49 us	0.240 us	0.224 us	18.50 us	18.19 us	19.03 us	-	-	-	-
ImmutableDictionary	512	18.44 us	0.117 us	0.104 us	18.45 us	18.30 us	18.66 us	-	-	-	-

The regression might be coming from extra padding we added in TryGetValue()

Assembly code of TryGetValue()

```asm G_M1624_IG02: ;; offset=0005H 00007ffb`9af1e3a5 488BF1 mov rsi, rcx 00007ffb`9af1e3a8 0F1F8400000000000F1F80000000000F1F84000000000090 align ; =========================== 32B boundary =========================== 00007ffb`9af1e3c0 align 00007ffb`9af1e3c0 align ;; bbWeight=1 PerfScore 1.00 G_M1624_IG03: ;; offset=0020H 00007ffb`9af1e3c0 48837E0800 cmp gword ptr [rsi+8], 0 00007ffb`9af1e3c5 7417 je SHORT G_M1624_IG07 ;; bbWeight=8 PerfScore 24.00 G_M1624_IG04: ;; offset=0027H 00007ffb`9af1e3c7 8B4618 mov eax, dword ptr [rsi+24] 00007ffb`9af1e3ca 3BD0 cmp edx, eax 00007ffb`9af1e3cc 741F je SHORT G_M1624_IG09 ;; bbWeight=4 PerfScore 13.00 G_M1624_IG05: ;; offset=002EH 00007ffb`9af1e3ce 3BD0 cmp edx, eax 00007ffb`9af1e3d0 7E06 jle SHORT G_M1624_IG06 00007ffb`9af1e3d2 488B7610 mov rsi, gword ptr [rsi+16] 00007ffb`9af1e3d6 EBE8 jmp SHORT G_M1624_IG03 ;; bbWeight=4 PerfScore 21.00 G_M1624_IG06: ;; offset=0038H 00007ffb`9af1e3d8 488B7608 mov rsi, gword ptr [rsi+8] 00007ffb`9af1e3dc EBE2 jmp SHORT G_M1624_IG03 ```

cc: @adamsitnik , @AndyAyersMS

AndyAyersMS commented 3 years ago

Do we have an explanation for the regression we see here?

As far as padding: for dictionaries, we actually don't expect lookups to iterate much as that means there are hash collisions. So it's certainly possible the cost of the padding (especially such a large amount like we see here) matters.

This might be a good case for the "minimal number of bundles" experiment, presumably without padding the loop would still fit in two bundles.

kunalspathak commented 3 years ago

I already had "minimum number of bundles" on, except that I was missing a check (needed <= instead of <) that decides if padding helps the loop or not. With that, we see slightly better performance.

Method	Size	Mean	Error	StdDev	Median	Min	Max	Gen 0	Gen 1	Gen 2	Allocated
ImmutableDictionary	512	18.06 us	0.203 us	0.190 us	18.07 us	17.78 us	18.46 us	-	-	-	-
ImmutableDictionary	512	17.92 us	0.038 us	0.029 us	17.92 us	17.86 us	17.95 us	-	-	-	-
ImmutableDictionary	512	18.02 us	0.221 us	0.207 us	17.94 us	17.77 us	18.43 us	-	-	-	-
ImmutableDictionary	512	18.09 us	0.198 us	0.176 us	18.04 us	17.89 us	18.45 us	-	-	-	-

Assembly code of TryGetValue()

```asm G_M1624_IG02: ;; offset=0005H 00007ffb`9844e385 488BF1 mov rsi, rcx ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset < extraBytesNotInLoop) ~~~~~~~~~~~~~~~~~~~~~~ 00007ffb`9844e388 align 00007ffb`9844e388 align 00007ffb`9844e388 align ;; bbWeight=1 PerfScore 1.00 G_M1624_IG03: ;; offset=0008H 00007ffb`9844e388 48837E0800 cmp gword ptr [rsi+8], 0 00007ffb`9844e38d 7417 je SHORT G_M1624_IG07 ;; bbWeight=8 PerfScore 24.00 G_M1624_IG04: ;; offset=000FH 00007ffb`9844e38f 8B4618 mov eax, dword ptr [rsi+24] 00007ffb`9844e392 3BD0 cmp edx, eax 00007ffb`9844e394 741F je SHORT G_M1624_IG09 ;; bbWeight=4 PerfScore 13.00 G_M1624_IG05: ;; offset=0016H 00007ffb`9844e396 3BD0 cmp edx, eax 00007ffb`9844e398 7E06 jle SHORT G_M1624_IG06 00007ffb`9844e39a 488B7610 mov rsi, gword ptr [rsi+16] 00007ffb`9844e39e EBE8 jmp SHORT G_M1624_IG03 ; =========================== 32B boundary =========================== ```

Through out the benchmark run, I tried to log places we add some larger alignment inserted and that might be causing some regression. We can talk offline, but just dumping out the places that added alignment.

alignment: No. of bytes of padding we added
loopsize: Size of the loop that we aligned
minBlocksNeeded: Minimum 32B chunks blocks needed to fit the loop
extraBytesNotInLoop: Maximum 32B+offset that a loop can start for which we won't add padding. If loop starts beyond this offset, we would add padding.

Places that add > 10 bytes of alignment in the benchmark run

```cmd ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 16 bytes, loopsize= 63 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 1 in (IndexOf) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 13 bytes, loopsize= 53 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 11 in (FindValue) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 16 bytes, loopsize= 30 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 2 in (IndexOf) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 29 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1 in (CopyTo) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 12 bytes, loopsize= 83 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 13 in (TryGetValueInternal) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 14 bytes, loopsize= 81 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 15 in (UnionWith) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 11 bytes, loopsize= 54 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 10 in (Resize) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 15 bytes, loopsize= 48 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 16 in (ComputeKeys) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 13 bytes, loopsize= 47 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 17 in (ComputeKeys) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 19 bytes, loopsize= 90 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 6 in (ReportIfAny) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 10 bytes, loopsize= 49 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 15 in (ArrayOfUniqueValues) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 25 bytes, loopsize= 59 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 5 in (FindItemIndex) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 12 bytes, loopsize= 78 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 18 in (FindItemIndex) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 16 bytes, loopsize= 64 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 0 in (AddIfNotPresent) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 10 bytes, loopsize= 89 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 7 in (TryInsert) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 14 bytes, loopsize= 84 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 12 in (.ctor) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 10 bytes, loopsize= 54 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 10 in (CopyTo) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 13 bytes, loopsize= 54 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 10 in (CopyTo) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 22 bytes, loopsize= 94 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 2 in (AcquireLocks) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 12 bytes, loopsize= 28 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 4 in (ReleaseLocks) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 20 bytes, loopsize= 26 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 6 in (GetValueOrDefault) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 11 bytes, loopsize= 32 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 0 in (OverheadActionNoUnroll) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 11 bytes, loopsize= 32 bytes, minBlocksNeeded= 1, extraBytesNotInLoop= 0 in (WorkloadActionNoUnroll) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 13 bytes, loopsize= 55 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 9 in (Sum) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 29 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1 in (SumWithoutOutliers) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 17 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1 in (StudentTwoTail) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 18 bytes, loopsize= 91 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 5 in (Print) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 29 bytes, loopsize= 95 bytes, minBlocksNeeded= 3, extraBytesNotInLoop= 1 in (CopyTo) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 23 bytes, loopsize= 56 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 8 in (ToArray) ; ~~~~~~~~~~~~~~~~~~~~~~ alignment= 12 bytes, loopsize= 52 bytes, minBlocksNeeded= 2, extraBytesNotInLoop= 12 in (ComputeKeys) ```

Places where we skipped adding padding

```cmd ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (GetPointerToFirstInvalidChar) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Resize) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (ToDictionary) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (TryInsert) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (SkipAndCount) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (MoveNext) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Run) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (IndexOfAny) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (IndexOfAny) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (IndexOfAny) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (Contains) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (GetRootLength) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (AddRange) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (CalculateAllocationQuantumSize) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (AttachToOwner) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (.ctor) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (TryGetValueInternal) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (FillAllCharacteristicsCore) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (FillAllCharacteristicsCore) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (MoveNext) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Add) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (ToArray) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (ComputeMap) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (ComputeKeys) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (TryAddInternal) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (ApplyCore) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Remove) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (GetBestTimeUnit) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (AddIfNotPresent) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Resize) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (CopyTo) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (ToArray) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (ToDictionary) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Resize) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (.ctor) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (.ctor) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (MoveNext) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (AddIfNotPresent) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (.ctor) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (InitializeFromCollection) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (TryAddInternal) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (GrowTable) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (GrowTable) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (GrowTable) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (GrowTable) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (AddRange) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (AddRange) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (InOrderTreeWalk) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (InOrderTreeWalk) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (Log2) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (TryGetFirst) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (TryGetValue) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (RunSpecific) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (Run) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (CreateValue) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (RunAuto) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (VarianceWithoutOutliers) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (IntroSort) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (ToDictionary) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (totalCodeSize <= emitComp->compJitAlignLoopMaxCodeSize) in (ToDictionary) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (FindValue) ; ~~~~~~~~~~~~~~~~~~~~~~ Skipping because (currentOffset <= extraBytesNotInLoop) in (FindValue) ```

AndyAyersMS commented 3 years ago

I wonder if you're seeing the impact of the "branch splitting or ending at 32 byte boundary" issue in some of these (eg dotnet/runtime#13795). For instance this jump now ends at 0x...A0 and so presumably get penalized.

 00007ffb`9844e39e        EBE8                 jmp      SHORT G_M1624_IG03

Is that something you can track?

kunalspathak commented 3 years ago

Thanks for pointing me to it. I tried adding a check for JCC but that leads logic to add extra padding that I showed earlier. So I think we need to evaluate if not splitting branch is worth than adding extra padding. Do you know any way of measure that apart from experimenting?

DrewScoggins / performance-2