Improve performance of PersistentAllocator for JITServer

mpirvu commented 4 years ago

Experiments have shown that scalability of JITServer is affected by the implementation of the persistent allocator. The current implementation of the persistent allocator uses 12 linked lists of freed blocks: 11 of them for small fixed sized blocks (8, 16, 24,.. 88 bytes) and one of them for all larger blocks, sorted by size. When a client terminates, all its corresponding data structures (most of which are allocated with persistent memory) are freed, which creates huge linked lists of freed blocks. The problem is the list for larger blocks, because allocations and deallocations need to traverse it to find the appropriate spot for delete/add operations. Moreover, all allocations and deallocations are performed under a single which could be hold for long times while traversing this long linked list. Experiments have also shown that replacing our persistent allocator with a simple malloc makes the scalability issue to go away to a large extent.

There are a few possible solutions to alleviate this problem: 1) Use more dedicated lists for small, fixed size blocks. These lists are accessed fast because we always allocate/deallocate from/to head of the list. 2) Use a dedicated lock per linked list. This will accelerate the allocations for small blocks because they are not going to be held up by a big allocation which is expensive. 3) Use a balanced binary search tree instead of a list for the set of big lists.

mpirvu commented 4 years ago

Statistics of small block sizes allocated by the persistent allocator at JITserver

Statistics on: Bytes requested per small alloc   Num samples = 4360649 SUM=133188648.000000
MAX=863176.000000 MIN=8.000000 Mean=30.543309 StdDev=800.888920
  --Bin--       --Value--
<8.000000         0.00% |
 8.000000         0.05% |
 16.000000       92.55% |*************************************
 24.000000        2.07% |
 32.000000        0.53% |
 40.000000        0.31% |
 48.000000        0.29% |
 56.000000        1.26% |
 64.000000        2.01% |
 72.000000        0.00% |
 80.000000        0.20% |
 88.000000        0.00% |
 96.000000        0.00% |
 104.000000       0.00% |
 112.000000       0.15% |
 120.000000       0.00% |
 128.000000       0.58% |

The vast majority of allocations are 16 bytes in size. Having dedicated linked lists for sizes until 112 (inclusive) captures 99.42% of all allocations. Still 0.58% of 4360649 allocations (~25K) are large allocations that will be put onto a single linked list.

mpirvu commented 4 years ago

Statistics of large block sizes allocated by the persistent allocator at JITServer

Statistics on: Bytes requested per large alloc   Num samples = 4360649 SUM=133188648.000000
MAX=863176.000000 MIN=8.000000 Mean=30.543309 StdDev=800.888920
  --Bin--       --Value--
<256.000000      99.55% |***************************************
 256.000000       0.03% |
 512.000000       0.02% |
 768.000000       0.11% |
 1024.000000      0.22% |
 1280.000000      0.01% |
 1536.000000      0.01% |
 1792.000000      0.00% |
 2048.000000      0.00% |
 2304.000000      0.00% |
 2560.000000      0.00% |
 2816.000000      0.00% |
 3072.000000      0.00% |
 3328.000000      0.00% |
 3584.000000      0.00% |
 3840.000000      0.00% |
 4096.000000      0.00% |
 4352.000000      0.00% |
 4608.000000      0.00% |
 4864.000000      0.00% |
 5120.000000      0.00% |
 5376.000000      0.00% |
 5632.000000      0.00% |
 5888.000000      0.00% |
 6144.000000      0.00% |
 6400.000000      0.00% |
 6656.000000      0.00% |
 6912.000000      0.00% |
 7168.000000      0.00% |
 7424.000000      0.00% |
 7680.000000      0.00% |
 7936.000000      0.00% |
 8192.000000      0.00% |
 8448.000000      0.00% |
 8704.000000      0.00% |
 8960.000000      0.00% |
 9216.000000      0.00% |
 9472.000000      0.00% |
 9728.000000      0.00% |
 9984.000000      0.00% |
 10240.000000     0.00% |
 10496.000000     0.00% |
 10752.000000     0.00% |
 11008.000000     0.00% |
 11264.000000     0.00% |
 11520.000000     0.00% |
 11776.000000     0.00% |
 12032.000000     0.00% |
 12288.000000     0.00% |
 12544.000000     0.00% |
 12800.000000     0.00% |
 13056.000000     0.00% |
 13312.000000     0.00% |
 13568.000000     0.00% |
 13824.000000     0.00% |
 14080.000000     0.00% |
 14336.000000     0.00% |
 14592.000000     0.00% |
 14848.000000     0.00% |
 15104.000000     0.00% |
 15360.000000     0.00% |
 15616.000000     0.00% |
 15872.000000     0.00% |
 16128.000000     0.00% |
 16384.000000     0.00% |
 16640.000000     0.00% |
 16896.000000     0.00% |
 17152.000000     0.00% |
 17408.000000     0.00% |
 17664.000000     0.00% |
 17920.000000     0.01% |
 18176.000000     0.00% |
 18432.000000     0.00% |
 18688.000000     0.00% |
 18944.000000     0.00% |
 19200.000000     0.00% |
 19456.000000     0.00% |
 19712.000000     0.00% |
 19968.000000     0.00% |
 20224.000000     0.00% |
 20480.000000     0.00% |
 20736.000000     0.00% |
 20992.000000     0.00% |
 21248.000000     0.00% |
 21504.000000     0.00% |
 21760.000000     0.00% |
 22016.000000     0.00% |
 22272.000000     0.00% |
 22528.000000     0.00% |
 22784.000000     0.00% |
 23040.000000     0.00% |
 23296.000000     0.00% |
 23552.000000     0.00% |
 23808.000000     0.00% |
 24064.000000     0.00% |
 24320.000000     0.00% |
 24576.000000     0.00% |
 24832.000000     0.00% |
 25088.000000     0.00% |
 25344.000000     0.00% |
 25600.000000     0.00% |
 25856.000000     0.01% |

Most of the larger blocks are smaller than 1.5K, but a few of them are quite large

mpirvu commented 4 years ago

I implemented solutions (1) and (2) from the issue description, but the scalability problem is still there. Using malloc for the list of big blocks (variable block size list) improves the situation considerably, so we must transform that linked list of freed blocks into a search tree. Probably red-black trees are better because they have fewer transformations for insert/delete (we only have insert/delete operations). A hash-table may consume more memory because of the spine that needs to grow dynamically.

mpirvu commented 4 years ago

Another idea to try before going for a BST is to group all blocks of the same size into the same node of the variable-size-block linked list. This could reduce the length of the variable-size-block list substantially.

mpirvu commented 4 years ago

Implementing the idea above I get the following stats for deallocations at JITServer after a full AcmeAir run. The bins represent the number of nodes we had to traverse through the list of variable size blocks to find the proper place for insertion:

Num samples = 30581 MAX=1223.000000 MIN=0.000000 Mean=48.843171 StdDev=121.005691
  --Bin--       --Value--
<64.000000       75.57% |******************************
 64.000000       16.54% |******
 128.000000       1.89% |
 192.000000       1.28% |
 256.000000       0.95% |
 320.000000       0.74% |
 384.000000       0.67% |
 448.000000       0.50% |
 512.000000       0.41% |
 576.000000       0.34% |
 640.000000       0.25% |
 704.000000       0.19% |
 768.000000       0.21% |
 832.000000       0.14% |
 896.000000       0.11% |
 960.000000       0.08% |
 1024.000000      0.04% |
 1088.000000      0.05% |
 1152.000000      0.03% |
 1216.000000      0.01% |

So sometimes we have to go for more 1000 elements through that list. I feel that a BST is needed here.

mpirvu commented 4 years ago

Instead of a BST I implemented a indexing scheme where we keep pointers towards nodes of the list at power of two values. Thus if I search for a block of size 1500 I can start from the 1024 point and not from the very beginning. The following is the same statistics for number of nodes visited during deallocations. Note that the maximum value has dropped from 1223 to 317, so this is an improvement, but not as good as a BST would have obtained.

Num samples = 19061 MAX=317.000000 MIN=0.000000 Mean=24.488957 StdDev=35.868741
  --Bin--       --Value--
<8.000000        57.50% |**********************
 8.000000         3.53% |*
 16.000000        2.91% |*
 24.000000        2.02% |
 32.000000        1.61% |
 40.000000        1.39% |
 48.000000       23.47% |*********
 56.000000        1.15% |
 64.000000        0.79% |
 72.000000        0.64% |
 80.000000        0.61% |
 88.000000        0.60% |
 96.000000        0.54% |
 104.000000       0.47% |
 112.000000       0.45% |
 120.000000       0.35% |
 128.000000       0.23% |
 136.000000       0.14% |
 144.000000       0.22% |
 152.000000       0.19% |
 160.000000       0.16% |
 168.000000       0.14% |
 176.000000       0.13% |
 184.000000       0.10% |
 192.000000       0.12% |
 200.000000       0.11% |
 208.000000       0.09% |
 216.000000       0.09% |
 224.000000       0.04% |
 232.000000       0.04% |
 240.000000       0.04% |
 248.000000       0.15% |

mpirvu commented 4 years ago

A perf profile for JITServer process when 8 JVM clients are attached to it. This JVM uses the indexing scheme described above, so scalability is not that bad.

   3.19%  [.] omrthread_spinlock_acquire
   1.37%  [.] TR_RegionStructure::checkForInternalCycles
   1.29%  [.] TR::Region::allocate
   0.68%  [.] OMR::Node::getBlock
   0.62%  [.] TR_RegisterCandidate::setWeight
   0.62%  [.] JITServerPersistentCHTable::commitModifications
   0.59%  [.] OMR::Block::getLastRealTreeTop
   0.58%  [.] TR_RegionAnalysis::simpleIterator
   0.58%  [.] TR_RegisterCandidates::computeAvailableRegisters
   0.55%  [.] OMR::Node::getExtendedChild
   0.55%  [.] TR_ResolvedMethod::getRecognizedMethod
   0.54%  [.] OMR::CFG::findReachableBlocks
   0.51%  [.] OMR::ValuePropagation::mergeStoreRelationships
   0.51%  [.] __tls_get_addr
   0.50%  [.] TR_RegisterCandidate::processLiveOnEntryBlocks
   0.49%  [.] std::deque<std::pair<TR_StructureSubGraphNode*, bool>, TR::typed_allocator<std::pair<TR_StructureSubGraphNode*, bool>, TR::Region&> >::emplace_front<std::pair<TR_StructureSubGraphNode*, bool> >
   0.49%  [k] do_syscall_64
   0.49%  [.] OMR::Block::getNextBlock
   0.46%  [.] OMR::Node::initializeFutureUseCounts
   0.44%  [.] OMR::Node::hasRegLoadStoreSymbolReference
   0.43%  [.] OMR::LocalCSE::examineNode
   0.43%  [.] OMR::X86::Machine::findBestFreeGPRegister
   0.42%  [.] TR_HedgeTreeHandler<OMR::ValuePropagation::ValueConstraint>::findOrCreate
   0.41%  [.] TR_ValueNumberInfo::allocateValueNumber
   0.41%  [.] JITServerHelpers::getRemoteROMClassIfCached
   0.40%  [.] OMR::ValuePropagation::mergeEdgeConstraints
   0.38%  [.] __memset_avx2_unaligned_erms
   0.37%  [.] TR_UseDefInfo::findTrivialSymbolsToExclude
   0.37%  [.] TR_Dominators::dominates
   0.36%  [.] malloc
   0.36%  [.] OMR::ValuePropagation::mergeRelationships
   0.34%  [.] OMR::Node::getSymbol
   0.34%  [.] TR_UseDefInfo::indexSymbolsAndNodes
   0.33%  [.] OMR::Simplifier::cleanupFlags
   0.33%  [.] TR_BackwardDFSetAnalysis<TR_BitVector*>::analyzeNodeIfSuccessorsAnalyzed
   0.32%  [.] TR_UseDefInfo::buildUseDefs
   0.32%  [.] OMR::Node::getSymbolReference
   0.31%  [.] TR_UseDefInfo::prepareUseDefInfo
   0.31%  [.] OMR::Node::getStoreNode
   0.30%  [.] TR_IProfiler::getValueProfileInfo
   0.30%  [.] TR_ForwardDFSetAnalysis<TR_BitVector*>::analyzeNodeIfPredecessorsAnalyzed
   0.30%  [.] std::_Hashtable<ClassLoaderStringPair, std::pair<ClassLoaderStringPair const, TR_OpaqueClassBlock*>, TR::typed_allocator<std::pair<ClassLoaderStringPair const, TR_OpaqueClassBlock*>, J9::PersistentAllo
   0.30%  [.] OMR::ValuePropagation::addConstraintToList
   0.29%  [.] OMR::CodeGenerator::simulationPrePass
   0.29%  [.] TR_ValueNumberInfo::initializeNode
   0.29%  [.] JITServerHelpers::getAndCacheRAMClassInfo
   0.29%  [.] TR_X86RegisterDependencyGroup::assignRegisters
   0.29%  [.] OMR::Compilation::isPotentialOSRPoint
   0.28%  [.] __memmove_avx_unaligned_erms
   0.27%  [.] ClientSessionData::getCachedIProfilerInfo
   0.27%  [.] OMR::CFG::removeUnreachableBlocks
   0.27%  [.] TR::DeadTreesElimination::prePerformOnBlocks
   0.26%  [.] TR_UseDefInfo::findUseDefNodes
   0.26%  [.] TR_BasicDFSetAnalysis<TR_BitVector*>::initializeAnalysisInfo
   0.26%  [.] TR_BitVectorIterator::getNextBit
   0.26%  [.] TR_ReachingDefinitions::initializeGenAndKillSetInfoForNode
   0.26%  [.] TR_UseDefInfo::buildUseDefs
   0.25%  [.] TR::NodeChecklist::contains

Notable symbols:

  3.19%  [.] omrthread_spinlock_acquire
  0.62%  [.] JITServerPersistentCHTable::commitModifications
  0.51%  [.] __tls_get_addr
  0.41%  [.] JITServerHelpers::getRemoteROMClassIfCached
  0.36%  [.] malloc

This shows that we may have some lock contention problem and this could be due to locking the CHTable and/or locking the hashtable with j9classes.

mpirvu commented 4 years ago

Same profile sorted by dso:

  82.41%  libj9jit29.so
   8.74%  [kernel.kallsyms]
   3.89%  libj9thr29.so
   2.15%  libc-2.27.so
   0.58%  libpthread-2.27.so
   0.51%  ld-2.27.so

eclipse-openj9 / openj9

Improve performance of PersistentAllocator for JITServer #10428