Closed jakobbotsch closed 2 weeks ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
TP regressions seem a little higher than expected (though the larger ones are MinOpts in collections that don't really have any MinOpts contexts). For libraries_tests.run
the detailed TP breakdown looks as follows for win-arm64:
Base: 687590947913, Diff: 688252896507, +0.0963%
372360965 : +12.49% : 20.23% : +0.0542% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const
239930066 : NA : 13.04% : +0.0349% : public: bool __cdecl GenTree::IsMultiRegCall(void) const
110847788 : +13.25% : 6.02% : +0.0161% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const
101054636 : +3.16% : 5.49% : +0.0147% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)
100502213 : NA : 5.46% : +0.0146% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)
89077935 : +8.05% : 4.84% : +0.0130% : public: bool __cdecl GenTree::IsMultiRegNode(void) const
56210000 : NA : 3.05% : +0.0082% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)
42525916 : +12.28% : 2.31% : +0.0062% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)
22625186 : NA : 1.23% : +0.0033% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)
18567447 : +2.96% : 1.01% : +0.0027% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
15514925 : +2.29% : 0.84% : +0.0023% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)
13875031 : +2.47% : 0.75% : +0.0020% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)
12892530 : +1.03% : 0.70% : +0.0019% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)
11865522 : +0.07% : 0.64% : +0.0017% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
7973960 : NA : 0.43% : +0.0012% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)
6961843 : NA : 0.38% : +0.0010% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)
6072251 : +55.03% : 0.33% : +0.0009% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const
4151149 : +15.57% : 0.23% : +0.0006% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)
3725060 : +127.26% : 0.20% : +0.0005% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const
3091240 : +0.60% : 0.17% : +0.0004% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)
2319037 : +0.61% : 0.13% : +0.0003% : public: struct Range __cdecl RangeCheck::ComputeRange(struct BasicBlock *, struct GenTree *, bool)
-3220492 : -10.27% : 0.17% : -0.0005% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)
-13773686 : -0.38% : 0.75% : -0.0020% : public: void __cdecl LinearScan::resolveRegisters<1>(void)
-14291659 : -0.25% : 0.78% : -0.0021% : memset
-17480730 : -2.61% : 0.95% : -0.0025% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)
-27597648 : -0.52% : 1.50% : -0.0040% : public: void __cdecl LinearScan::resolveRegisters<0>(void)
-38963750 : -4.31% : 2.12% : -0.0057% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
-47352683 : -2.20% : 2.57% : -0.0069% : public: void __cdecl LinearScan::buildIntervals<1>(void)
-50608682 : -5.57% : 2.75% : -0.0074% : public: void __cdecl LinearScan::buildIntervals<0>(void)
-56483652 : -1.51% : 3.07% : -0.0082% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)
-315652337 : -99.64% : 17.15% : -0.0459% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const
Seems mostly inlining related.
For libraries.crossgen2.windows.arm64
it's definitely more noticeable:
Base: 177447753443, Diff: 177891827092, +0.2503%
174306936 : NA : 24.22% : +0.0982% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)
80864611 : +12.49% : 11.24% : +0.0456% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const
49816419 : NA : 6.92% : +0.0281% : public: bool __cdecl GenTree::IsMultiRegCall(void) const
47938235 : NA : 6.66% : +0.0270% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)
43693038 : +17.18% : 6.07% : +0.0246% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
38175511 : +18.28% : 5.31% : +0.0215% : public: bool __cdecl GenTree::IsMultiRegNode(void) const
28826741 : +0.67% : 4.01% : +0.0162% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
24704416 : NA : 3.43% : +0.0139% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)
21789774 : +13.51% : 3.03% : +0.0123% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const
20644334 : +3.20% : 2.87% : +0.0116% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)
12455570 : +1.02% : 1.73% : +0.0070% : memset
9069883 : +12.10% : 1.26% : +0.0051% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)
8769472 : +6.27% : 1.22% : +0.0049% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)
3195549 : NA : 0.44% : +0.0018% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)
2973029 : +1.02% : 0.41% : +0.0017% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)
2551146 : +0.11% : 0.35% : +0.0014% : public: void __cdecl LinearScan::resolveRegisters<1>(void)
2071684 : +1087.25% : 0.29% : +0.0012% : public: unsigned int __cdecl GenTreeLclVarCommon::GetLclNum(void) const
1984988 : +2.00% : 0.28% : +0.0011% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)
1177851 : +55.13% : 0.16% : +0.0007% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const
1036110 : +0.67% : 0.14% : +0.0006% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)
1024481 : NA : 0.14% : +0.0006% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)
893343 : +15.27% : 0.12% : +0.0005% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)
791266 : +127.25% : 0.11% : +0.0004% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const
-2071684 : -100.00% : 0.29% : -0.0012% : public: virtual unsigned int __cdecl DefaultPolicy::EstimatedTotalILSize(void) const
-2965180 : -2.67% : 0.41% : -0.0017% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)
-17124652 : -5.25% : 2.38% : -0.0097% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
-21372341 : -3.01% : 2.97% : -0.0120% : public: void __cdecl LinearScan::buildIntervals<1>(void)
-22153697 : -2.99% : 3.08% : -0.0125% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)
-70576480 : -99.23% : 9.81% : -0.0398% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const
I'll see if we can do something.
I got rid of an unnecessary hash map lookup in NodeInternalRegisters::Add
, but cost is still +0.22%. The node types we most often see with internal registers on arm64 are:
Node types with internal regs
CALL : 2953364
STORE_BLK : 237732
LEA : 8792
SWITCH_TABLE : 6838
CNS_DBL : 4548
PUTARG_STK : 4472
CNS_VEC : 2838
MUL : 608
LCLHEAP : 336
RETURN : 110
STOREIND : 78
IND : 58
XAND : 8
STORE_LCL_VAR : 6
INDEX_ADDR : 4
LCL_FLD : 2
GT_CALL
is because we always reserve an extra internal register for R2R calls on arm64 to load the call target:
https://github.com/dotnet/runtime/blob/55d2adab3aa7e3fb764a44bf5661cffd21936596/src/coreclr/jit/lsraarmarch.cpp#L193
I'm inclined to just have the backend always use LR
for this purpose since that register is being trashed by the call anyway.
Edit: Avoided this internal register by using LR instead, which had large TP improvements for crossgen2 arm64 compilations.
The remaining one is libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch and looks to be the same as above, i.e. primarily because of some different inlining decisions:
Base: 13520450455, Diff: 13548034810, +0.2040%
12877010 : +12.50% : 17.65% : +0.0952% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const
8785441 : NA : 12.04% : +0.0650% : public: bool __cdecl GenTree::IsMultiRegCall(void) const
5540860 : NA : 7.59% : +0.0410% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)
3869963 : +12.68% : 5.30% : +0.0286% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const
3404030 : +3.09% : 4.66% : +0.0252% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)
3342806 : +7.92% : 4.58% : +0.0247% : public: bool __cdecl GenTree::IsMultiRegNode(void) const
2274456 : NA : 3.12% : +0.0168% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)
1903792 : NA : 2.61% : +0.0141% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)
1542740 : +12.58% : 2.11% : +0.0114% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)
1257943 : +27.27% : 1.72% : +0.0093% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
948336 : +55.17% : 1.30% : +0.0070% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const
890189 : NA : 1.22% : +0.0066% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)
739107 : +1.84% : 1.01% : +0.0055% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)
637912 : +0.17% : 0.87% : +0.0047% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)
612467 : +15.66% : 0.84% : +0.0045% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)
553196 : +127.27% : 0.76% : +0.0041% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const
349211 : +0.39% : 0.48% : +0.0026% : memset
332442 : +2.69% : 0.46% : +0.0025% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)
141203 : +0.55% : 0.19% : +0.0010% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)
134540 : NA : 0.18% : +0.0010% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)
-74232 : -2.79% : 0.10% : -0.0005% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)
-77517 : -0.01% : 0.11% : -0.0006% : private: enum _regNumber_enum __cdecl LinearScan::allocateRegMinimal(class Interval *, class RefPosition *)
-79028 : -10.00% : 0.11% : -0.0006% : protected: void __cdecl CodeGen::genUnspillRegIfNeeded(struct GenTree *, unsigned int)
-158056 : -16.67% : 0.22% : -0.0012% : public: enum _regNumber_enum __cdecl GenTree::GetRegByIndex(int) const
-420583 : -10.34% : 0.58% : -0.0031% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)
-1232798 : -0.44% : 1.69% : -0.0091% : public: void __cdecl LinearScan::resolveRegisters<0>(void)
-1319674 : -17.38% : 1.81% : -0.0098% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)
-1519495 : -1.10% : 2.08% : -0.0112% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)
-3536016 : -5.38% : 4.85% : -0.0262% : public: void __cdecl LinearScan::buildIntervals<0>(void)
-14124781 : -99.98% : 19.36% : -0.1045% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const
cc @dotnet/jit-contrib PTAL @kunalspathak
This gets rid of GenTree::gtRsvdRegs
by moving internal registers to a side table. We generally use internal registers very rarely, so I think making the lookup more costly is worth the trade off (especially to make it easier to expand regMaskTP
to 16 bytes).
There was one exception where we used internal registers a lot, which was GT_CALL
for R2R codegen on arm64/arm32. For those nodes we always allocate an internal register to load the target into (the target is obtained by loading the R2R indirection cell that is passed in an argument register).
For arm64 it was simple to avoid this internal register: we can simply use LR always, since that register is going to be overwritten by the call anyway. This results in -2% TP for crossgen2 arm64 just from avoiding building this extra interval. This is also the cause of the asm diffs.
For arm32 the same strategy doesn't work as well because loading into LR is a 4 byte instruction while loading into other registers is a 2 byte instruction. So for arm32 I left it with the internal register and I propose we just take the throughput hit there (arm32 crossgen2 is probably not the most important target).
For the TP hit in some of the other collections see my comments above. Most of the hit comes from different inlining decisions that we expect to be significantly altered by native PGO anyway, or MinOpts in collections with very few MinOpts contexts that thus aren't very representative.
Paying a bit more to access these seems worth it when it leads to a 10% reduction in size of
GenTree
. It's very rare for an IR node to have any internal registers allocated.For arm64 this is the distribution of number of internal registers allocated for every
GenTree
after LSRA:Memory diff for arm64: https://www.diffchecker.com/fGt6Abyp/ (around 1.5% less memory used overall and makes it cheaper to expand to more than 64 registers in the future)
Contributes to #98258