Closed fwyzard closed 5 years ago
Should we centralize here the discussion instead of scattering it around the many PR that accumulated over the weeks?
benchmark on patatrack02 T4 (w/o mps) No Transfer, No Conversion
default configuration (which is not really apple vs apple due to #369 and it is optimized for Ideal)
No cluster Shape cut
optimized realistic (again not really apple vs apple due to #369 see below )
for this last case statistics
||Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
|| reference | 5000| 16482.7| 94935.7| 1105.0| 337.7| 1383.5| 4.7| 225.2| 0.968| 0.982||
|| development | 5000| 16482.7| 88107.2| 650.3| 327.9| 1340.9| 3.8| 324.0| 0.947| 0.986||
for the case NO cluster shape cut (reference reports several overflows, development 3 minor ones)
|Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
|| reference | 5000| 16482.7| 108218.1| 1285.0| 349.1| 1430.5| 5.0| 245.5| 0.967| 0.983||
|| development | 5000| 16482.7| 108284.3| 782.9| 356.8| 1461.2| 5.1| 361.0| 0.944| 0.987||
so yes, in development there are less doublets with CSC (expected) and less tracks (due to early fishbone?). the number of good-tracks is similar (see MTV for the physics interpretation)
running with lateFishbone development's counters look like
|| | 5000| 16482.7| 88147.5| 1141.4| 329.2| 1346.1| 3.8| 216.1| 0.947| 0.981||
and throughput is 819.9 ± 1.7 ev/s
OK.
I've rebased #362 on top of the current CMSSW_10_6_X_Patatrack
branch so I can test the individual commits, to check the individual impact on the throughput; I'm also running on patatrack02 in case the performance there is different than on hilton.
@fwyzard I would not be surprised if the throughput regression come all from theUsed
.
I would like to note that this is a "multi-purpose bit-field flag" that in future can be used to mark a doublet passing certain cuts to be applied later in the workflow (for instance if one wants to optimize cuts depending on nhits and/or topology. (Of course with costs in terms of throughput).
Something on the list of TBD is also to test if a SOA like Cell be more efficient (I started with the infamous optimization of the memory that caused random crashes before Xmas).
I have removed theUsed
and all references to it in #375.
In this case lateFishbone
is mandatory as earlyFishbone
is based on theUsed
.
throughput becomes 824.9 ± 3.1 ev/s
(counters unaffected as expected)
theUsed
+ earlyFishbone
is faster.
ok NOT all regression w/r/t reference comes from theUsed
.
There are additional kernel due to triplets.
one can remove the last 4 kernels as well (see below)
throughput becomes 832.2 ± 2.9 ev/s counters unaffected
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
@@ -264,6 +264,7 @@ void CAHitNtupletGeneratorKernelsGPU::classifyTuples(HitsOnCPU const &hh,
device_theCells_.get(), device_nCells_, tuples_d, tracks_d);
cudaCheck(cudaGetLastError());
+/*
// fill hit->track "map"
numberOfBlocks = (CAConstants::maxNumberOfQuadruplets() + blockSize - 1) / blockSize;
kernel_countHitInTracks<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
@@ -280,6 +281,7 @@ void CAHitNtupletGeneratorKernelsGPU::classifyTuples(HitsOnCPU const &hh,
kernel_tripletCleaner<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
hh.view(), tuples_d, tracks_d, quality_d, device_hitToTuple_.get());
cudaCheck(cudaGetLastError());
+*/
if (m_params.doStats_) {
// counters (add flag???)
if we do not run earlyFishbone
we can remove also the full check in find_ntuplets
throughput becomes 838.8 ± 4.4 ev/s
@@ -276,8 +276,8 @@ __global__ void kernel_find_ntuplets(GPUCACell::Hits const *__restrict__ hhp,
auto first = threadIdx.x + blockIdx.x * blockDim.x;
for (int idx = first, nt = (*nCells); idx<nt; idx += gridDim.x * blockDim.x) {
auto const &thisCell = cells[idx];
- if (thisCell.theDoubletId < 0 || thisCell.theUsed>1)
- continue;
+ // if (thisCell.theDoubletId < 0) // || thisCell.theUsed>1)
+ // continue;
// this stuff may need further optimization....
bool myStart[3];
and if I remove the myStart
stuff needed for triplets and iterations
we arrive at a throughput 856.7 ± 5.9 ev/s that is very close to reference
so if we give up Iterations and simplify the identification of the starting layer we can indeed gain some speed w/o touching the rest of the logic
the full "roll-back" patch is in /afs/cern.ch/user/i/innocent/public/quadOnly.patch it requires
process.caHitNtupletCUDA.earlyFishbone = False
process.caHitNtupletCUDA.lateFishbone = True
and of course works for quadruplets only
ok this will speed up 3%
+ auto pid = thisCell.theLayerPairId;
+ auto doit = minHitsPerNtuplet>3 ? pid <3 : pid <8 || pid > 12;
BUT it will reco only Quads starting from BPIX1 (as in reference): the previous version was also reconstructing quadruplets from layer2 in the pentuplet region..... (either bug or feature: choose)
with this config counters of #375 are
||Counters | nEvents | nHits | nCells | nTuples | nFitTacks | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
|| | 5000| 16482.7| 88147.5| 595.8| 514.3| 316.6| 1295.7| 3.9| 324.1| 0.947| 0.986||
the following alternative patch that preserves iterations speeds up quads only of ~1%
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
index f30ba9581f6..ac6a0762ca2 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
@@ -94,7 +94,7 @@ void CAHitNtupletGeneratorKernelsGPU::launchKernels(
device_hitTuple_apc_,
quality_d,
m_params.minHitsPerNtuplet_,
- m_params.doIterations_ ? startLayer : -1);
+ m_params.doIterations_ ? startLayer : ((m_params.minHitsPerNtuplet_>3) ? 0 : -1));
cudaCheck(cudaGetLastError());
if (m_params.doIterations_ || m_params.doStats_)
OK, so we have
I would like to do a comparison of their throughput vs the original one on hilton. Unfortunately something is making most of my tests fail since a couple of days, and I'm still investigating...
I have created a new branch SOATV8 on top of #375 where the Iterations support has been removed, Quads are limited to start on BPIX1 and the last four kernels (hit2track assoc + triplet cleaner) are run only for Triplets or if Stat is required.
No regression observed for Triplets
Please let me know if you prefer that I add those commit to #375 or open a new PR
Please let me know if you prefer that I add those commit to #375 or open a new PR
I would prefer to review these changes directly on top of #362, to address there the performance issue, before moving to merge the following PRs.
However, if applying the changes to #362 is too complicated, just update #375: I can leave #362 on hold, review the following PRs, and merge them all after the full review.
@fwyzard apparently #362 merges what I can do is to apply them to #362 AND #375 we can them play with merge and revert to move forward with the other PRs
On Sat, 17 Aug 2019, 12:05 Vincenzo Innocente, notifications@github.com wrote:
@fwyzard https://github.com/fwyzard apparently #362 https://github.com/cms-patatrack/cmssw/pull/362 merges what I can do is to apply them to #362 https://github.com/cms-patatrack/cmssw/pull/362 AND #375 https://github.com/cms-patatrack/cmssw/pull/375
OK !
if I remove earlyDuplicateRemover
from #375 I gain 20ev/s for quads and 10ev/s for triplets
final physics performance should not be affected (tbc).
Let's see how to make it optional...
to give a metric: using "less strict" cuts for jumping doublets one looses 10% of the throughput.
Implement full pixel triplets (#362) [included in and replaced by #375]
enable "jumping doublets" in the forward (just FPIX1,FPIX3)
process.pixelTracksHitQuadruplets.includeJumpingForwardDoublets = True
(only for triplets) enable three iterations of
kernel_find_ntuplets
- starting from different layers
- skipping doublets already used in previous ones
process.pixelTracksHitQuadruplets.doIterations = True
Implement a LegacyCluster to SoA@CPU for PixelRecHits (#371) [included in #372]
Make the vertex producer to run on both GPU and CPU (#372)
The GPU modules now do not
produce
any non-transient collection, so they cannot be scheduled by consuming their products from anOutputModule
.So, the customisation to run only the reconstruction must be applied after the conversion to unscheduled:
Implement full Pixel SoA workflow on CPU (#375)