dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.23k stars 4.45k forks source link

JIT: Move internal reserved registers to a side table #101647

Closed jakobbotsch closed 2 weeks ago

jakobbotsch commented 2 weeks ago

Paying a bit more to access these seems worth it when it leads to a 10% reduction in size of GenTree. It's very rare for an IR node to have any internal registers allocated.

For arm64 this is the distribution of number of internal registers allocated for every GenTree after LSRA:

Num internal temps
     <=          0 ===> 16256371 count ( 99% of total)
      1 ..       1 ===>   40149 count ( 99% of total)
      2 ..       2 ===>   11118 count ( 99% of total)
      3 ..       3 ===>   10886 count ( 99% of total)
      4 ..       4 ===>    2201 count ( 99% of total)
      5 ..       5 ===>    1194 count (100% of total)
      6 ..       6 ===>       0 count (100% of total)
      7 ..       7 ===>       0 count (100% of total)
      >          8 ===>       0 count (100% of total)

Memory diff for arm64: https://www.diffchecker.com/fGt6Abyp/ (around 1.5% less memory used overall and makes it cheaper to expand to more than 64 registers in the future)

Contributes to #98258

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

jakobbotsch commented 2 weeks ago

TP regressions seem a little higher than expected (though the larger ones are MinOpts in collections that don't really have any MinOpts contexts). For libraries_tests.run the detailed TP breakdown looks as follows for win-arm64:

Base: 687590947913, Diff: 688252896507, +0.0963%

372360965  : +12.49%  : 20.23% : +0.0542% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
239930066  : NA       : 13.04% : +0.0349% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
110847788  : +13.25%  : 6.02%  : +0.0161% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
101054636  : +3.16%   : 5.49%  : +0.0147% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
100502213  : NA       : 5.46%  : +0.0146% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
89077935   : +8.05%   : 4.84%  : +0.0130% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
56210000   : NA       : 3.05%  : +0.0082% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
42525916   : +12.28%  : 2.31%  : +0.0062% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
22625186   : NA       : 1.23%  : +0.0033% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
18567447   : +2.96%   : 1.01%  : +0.0027% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
15514925   : +2.29%   : 0.84%  : +0.0023% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
13875031   : +2.47%   : 0.75%  : +0.0020% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
12892530   : +1.03%   : 0.70%  : +0.0019% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
11865522   : +0.07%   : 0.64%  : +0.0017% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
7973960    : NA       : 0.43%  : +0.0012% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
6961843    : NA       : 0.38%  : +0.0010% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
6072251    : +55.03%  : 0.33%  : +0.0009% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
4151149    : +15.57%  : 0.23%  : +0.0006% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
3725060    : +127.26% : 0.20%  : +0.0005% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
3091240    : +0.60%   : 0.17%  : +0.0004% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)                                                                        
2319037    : +0.61%   : 0.13%  : +0.0003% : public: struct Range __cdecl RangeCheck::ComputeRange(struct BasicBlock *, struct GenTree *, bool)                                                                                                 
-3220492   : -10.27%  : 0.17%  : -0.0005% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)                                                                                                                        
-13773686  : -0.38%   : 0.75%  : -0.0020% : public: void __cdecl LinearScan::resolveRegisters<1>(void)                                                                                                                                         
-14291659  : -0.25%   : 0.78%  : -0.0021% : memset                                                                                                                                                                                             
-17480730  : -2.61%   : 0.95%  : -0.0025% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-27597648  : -0.52%   : 1.50%  : -0.0040% : public: void __cdecl LinearScan::resolveRegisters<0>(void)                                                                                                                                         
-38963750  : -4.31%   : 2.12%  : -0.0057% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-47352683  : -2.20%   : 2.57%  : -0.0069% : public: void __cdecl LinearScan::buildIntervals<1>(void)                                                                                                                                           
-50608682  : -5.57%   : 2.75%  : -0.0074% : public: void __cdecl LinearScan::buildIntervals<0>(void)                                                                                                                                           
-56483652  : -1.51%   : 3.07%  : -0.0082% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-315652337 : -99.64%  : 17.15% : -0.0459% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    

Seems mostly inlining related.

jakobbotsch commented 2 weeks ago

For libraries.crossgen2.windows.arm64 it's definitely more noticeable:

Base: 177447753443, Diff: 177891827092, +0.2503%

174306936 : NA        : 24.22% : +0.0982% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
80864611  : +12.49%   : 11.24% : +0.0456% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
49816419  : NA        : 6.92%  : +0.0281% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
47938235  : NA        : 6.66%  : +0.0270% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
43693038  : +17.18%   : 6.07%  : +0.0246% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
38175511  : +18.28%   : 5.31%  : +0.0215% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
28826741  : +0.67%    : 4.01%  : +0.0162% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
24704416  : NA        : 3.43%  : +0.0139% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
21789774  : +13.51%   : 3.03%  : +0.0123% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
20644334  : +3.20%    : 2.87%  : +0.0116% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
12455570  : +1.02%    : 1.73%  : +0.0070% : memset                                                                                                                                                                                             
9069883   : +12.10%   : 1.26%  : +0.0051% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
8769472   : +6.27%    : 1.22%  : +0.0049% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
3195549   : NA        : 0.44%  : +0.0018% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
2973029   : +1.02%    : 0.41%  : +0.0017% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
2551146   : +0.11%    : 0.35%  : +0.0014% : public: void __cdecl LinearScan::resolveRegisters<1>(void)                                                                                                                                         
2071684   : +1087.25% : 0.29%  : +0.0012% : public: unsigned int __cdecl GenTreeLclVarCommon::GetLclNum(void) const                                                                                                                            
1984988   : +2.00%    : 0.28%  : +0.0011% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
1177851   : +55.13%   : 0.16%  : +0.0007% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
1036110   : +0.67%    : 0.14%  : +0.0006% : private: void __cdecl Compiler::fgInvokeInlineeCompiler(struct GenTreeCall *, class InlineResult *, class InlineContext **)                                                                        
1024481   : NA        : 0.14%  : +0.0006% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
893343    : +15.27%   : 0.12%  : +0.0005% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
791266    : +127.25%  : 0.11%  : +0.0004% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
-2071684  : -100.00%  : 0.29%  : -0.0012% : public: virtual unsigned int __cdecl DefaultPolicy::EstimatedTotalILSize(void) const                                                                                                               
-2965180  : -2.67%    : 0.41%  : -0.0017% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-17124652 : -5.25%    : 2.38%  : -0.0097% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-21372341 : -3.01%    : 2.97%  : -0.0120% : public: void __cdecl LinearScan::buildIntervals<1>(void)                                                                                                                                           
-22153697 : -2.99%    : 3.08%  : -0.0125% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-70576480 : -99.23%   : 9.81%  : -0.0398% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    

I'll see if we can do something.

jakobbotsch commented 2 weeks ago

I got rid of an unnecessary hash map lookup in NodeInternalRegisters::Add, but cost is still +0.22%. The node types we most often see with internal registers on arm64 are:

Node types with internal regs
CALL                 : 2953364
STORE_BLK            :  237732
LEA                  :    8792
SWITCH_TABLE         :    6838
CNS_DBL              :    4548
PUTARG_STK           :    4472
CNS_VEC              :    2838
MUL                  :     608
LCLHEAP              :     336
RETURN               :     110
STOREIND             :      78
IND                  :      58
XAND                 :       8
STORE_LCL_VAR        :       6
INDEX_ADDR           :       4
LCL_FLD              :       2

GT_CALL is because we always reserve an extra internal register for R2R calls on arm64 to load the call target: https://github.com/dotnet/runtime/blob/55d2adab3aa7e3fb764a44bf5661cffd21936596/src/coreclr/jit/lsraarmarch.cpp#L193

I'm inclined to just have the backend always use LR for this purpose since that register is being trashed by the call anyway.

Edit: Avoided this internal register by using LR instead, which had large TP improvements for crossgen2 arm64 compilations.

jakobbotsch commented 2 weeks ago

The remaining one is libraries_tests_no_tiered_compilation.run.windows.arm64.Release.mch and looks to be the same as above, i.e. primarily because of some different inlining decisions:

Base: 13520450455, Diff: 13548034810, +0.2040%

12877010  : +12.50%  : 17.65% : +0.0952% : public: bool __cdecl GenTree::gtHasReg(class Compiler *) const                                                                                                                                     
8785441   : NA       : 12.04% : +0.0650% : public: bool __cdecl GenTree::IsMultiRegCall(void) const                                                                                                                                           
5540860   : NA       : 7.59%  : +0.0410% : public: void __cdecl NodeInternalRegisters::Add(struct GenTree *, unsigned __int64)                                                                                                                
3869963   : +12.68%  : 5.30%  : +0.0286% : public: unsigned __int64 __cdecl GenTree::gtGetRegMask(void) const                                                                                                                                 
3404030   : +3.09%   : 4.66%  : +0.0252% : protected: void __cdecl CodeGen::genProduceReg(struct GenTree *)                                                                                                                                   
3342806   : +7.92%   : 4.58%  : +0.0247% : public: bool __cdecl GenTree::IsMultiRegNode(void) const                                                                                                                                           
2274456   : NA       : 3.12%  : +0.0168% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::Extract(struct GenTree *, unsigned __int64)                                                                                            
1903792   : NA       : 2.61%  : +0.0141% : class CodeGenInterface * __cdecl getCodeGenerator(class Compiler *)                                                                                                                                
1542740   : +12.58%  : 2.11%  : +0.0114% : private: void __cdecl LinearScan::BuildDefs(struct GenTree *, int, unsigned __int64)                                                                                                               
1257943   : +27.27%  : 1.72%  : +0.0093% : private: void __cdecl JitHashTable<struct CORINFO_FIELD_STRUCT_*, struct JitPtrKeyFuncs<struct CORINFO_FIELD_STRUCT_>, class FieldSeq, class CompAllocator, class JitHashTableBehavior>::Grow(void)
948336    : +55.17%  : 1.30%  : +0.0070% : public: unsigned int __cdecl GenTree::GetMultiRegCount(class Compiler *) const                                                                                                                     
890189    : NA       : 1.22%  : +0.0066% : public: unsigned int __cdecl NodeInternalRegisters::Count(struct GenTree *, unsigned __int64)                                                                                                      
739107    : +1.84%   : 1.01%  : +0.0055% : private: int __cdecl LinearScan::BuildCall(struct GenTreeCall *)                                                                                                                                   
637912    : +0.17%   : 0.87%  : +0.0047% : public: void * __cdecl ArenaAllocator::allocateMemory(unsigned __int64)                                                                                                                            
612467    : +15.66%  : 0.84%  : +0.0045% : protected: void __cdecl CodeGen::genMultiRegStoreToLocal(struct GenTreeLclVar *)                                                                                                                   
553196    : +127.27% : 0.76%  : +0.0041% : public: enum var_types __cdecl GenTree::GetRegTypeByIndex(int) const                                                                                                                               
349211    : +0.39%   : 0.48%  : +0.0026% : memset                                                                                                                                                                                             
332442    : +2.69%   : 0.46%  : +0.0025% : private: void __cdecl emitter::emitInsLoadStoreOp(enum instruction, enum emitAttr, enum _regNumber_enum, struct GenTreeIndir *)                                                                    
141203    : +0.55%   : 0.19%  : +0.0010% : protected: void __cdecl CodeGen::genCallInstruction(struct GenTreeCall *)                                                                                                                          
134540    : NA       : 0.18%  : +0.0010% : public: enum _regNumber_enum __cdecl NodeInternalRegisters::GetSingle(struct GenTree *, unsigned __int64)                                                                                          
-74232    : -2.79%   : 0.10%  : -0.0005% : public: struct GenTree * __cdecl Compiler::gtCloneExpr(struct GenTree *)                                                                                                                           
-77517    : -0.01%   : 0.11%  : -0.0006% : private: enum _regNumber_enum __cdecl LinearScan::allocateRegMinimal(class Interval *, class RefPosition *)                                                                                        
-79028    : -10.00%  : 0.11%  : -0.0006% : protected: void __cdecl CodeGen::genUnspillRegIfNeeded(struct GenTree *, unsigned int)                                                                                                             
-158056   : -16.67%  : 0.22%  : -0.0012% : public: enum _regNumber_enum __cdecl GenTree::GetRegByIndex(int) const                                                                                                                             
-420583   : -10.34%  : 0.58%  : -0.0031% : protected: void __cdecl CodeGen::genCodeForCpBlkUnroll(struct GenTreeBlk *)                                                                                                                        
-1232798  : -0.44%   : 1.69%  : -0.0091% : public: void __cdecl LinearScan::resolveRegisters<0>(void)                                                                                                                                         
-1319674  : -17.38%  : 1.81%  : -0.0098% : public: void __cdecl Compiler::compInit(class ArenaAllocator *, struct CORINFO_METHOD_STRUCT_*, class ICorJitInfo *, struct CORINFO_METHOD_INFO *, struct InlineInfo *)                            
-1519495  : -1.10%   : 2.08%  : -0.0112% : private: class RefPosition * __cdecl LinearScan::BuildDef(struct GenTree *, unsigned __int64, int)                                                                                                 
-3536016  : -5.38%   : 4.85%  : -0.0262% : public: void __cdecl LinearScan::buildIntervals<0>(void)                                                                                                                                           
-14124781 : -99.98%  : 19.36% : -0.1045% : public: bool __cdecl GenTreeCall::HasMultiRegRetVal(void) const                                                                                                                                    
jakobbotsch commented 2 weeks ago

cc @dotnet/jit-contrib PTAL @kunalspathak

Diffs

This gets rid of GenTree::gtRsvdRegs by moving internal registers to a side table. We generally use internal registers very rarely, so I think making the lookup more costly is worth the trade off (especially to make it easier to expand regMaskTP to 16 bytes). There was one exception where we used internal registers a lot, which was GT_CALL for R2R codegen on arm64/arm32. For those nodes we always allocate an internal register to load the target into (the target is obtained by loading the R2R indirection cell that is passed in an argument register). For arm64 it was simple to avoid this internal register: we can simply use LR always, since that register is going to be overwritten by the call anyway. This results in -2% TP for crossgen2 arm64 just from avoiding building this extra interval. This is also the cause of the asm diffs. For arm32 the same strategy doesn't work as well because loading into LR is a 4 byte instruction while loading into other registers is a 2 byte instruction. So for arm32 I left it with the internal register and I propose we just take the throughput hit there (arm32 crossgen2 is probably not the most important target).

For the TP hit in some of the other collections see my comments above. Most of the hit comes from different inlining decisions that we expect to be significantly altered by native PGO anyway, or MinOpts in collections with very few MinOpts contexts that thus aren't very representative.