fwyzard commented 5 years ago

Implement full pixel triplets (#362) [included in and replaced by #375]

read https://github.com/cms-patatrack/cmssw/pull/362#issue-291627445


# switch on triplets
process.pixelTracksHitQuadruplets.minHitsPerNtuplet = 3

enable "jumping doublets" in the forward (just FPIX1,FPIX3)

process.pixelTracksHitQuadruplets.includeJumpingForwardDoublets = True

(only for triplets) enable three iterations of `kernel_find_ntuplets`

- starting from different layers

- skipping doublets already used in previous ones

process.pixelTracksHitQuadruplets.doIterations = True


### issues
  - [x] comparing the tracking performance, there does not seem to be any impact from enabling iterations; what is the expected behaviour ?
  - [x] "jumping doublets" seem to impact on triplets, not quadruplets
  - [x] with the latest commits, the throughput (without triplets) is within 2-3% of the reference; the original performance can be recovered tuning the `CAConstants`
  - [x] `earlyDuplicateRemover` should not be removed :-)
  - [x] the `CAConstants` have been tuned in #375
  - [x] the unused parameters were removed in #375 and #381

### further comments
  - when we enable triplets, we need to add quality cuts in the downstream code;
  - there are some minor comments about the code; they can be followed up in a later PR after the whole series has been merged.

## Build seeds directly from cpu product (#365) [merged]
  - read https://github.com/cms-patatrack/cmssw/pull/365#issue-293615697
  - read https://github.com/cms-sw/cmssw/pull/27220#issuecomment-502598787
  - requires #27220 for Tracking code changes (merged in 11.0.0-pre3)
  - requires #27595 for HLT configuration updates (merged in master, backported as #378)

## Fix a bug in (and extend) the cluster shape cut (#369) [included in and replaced by #375]

### issues
  - [x] can similar bugs be flagged by a compiler warning ?
      - GCC would generate a warning with -Wshadow or -Wshadow=local, but we are not using them in CMSSW (see https://github.com/cms-sw/cmssw/issues/27675)
      - NVCC does not have these kind of warnings anyway
  - [x] in the **design** case, there is a significant reduction in fakes at low pT (~25% for pT < 3 GeV), for a small reduction in efficiency at high |eta| (~3% for |eta| > ~1.5)
  - [x] in the **realistic** case, there is a smaller reduction in fakes, for a similar drop in efficiency
  - [x] review the impact on the throughput (using the *realistic* cuts on the usual 2018 data):
      - #362 (rebased): `941 ±  11 ev/s`
      - #369 (rebased): `991 ±  16 ev/s`

### further comments
  - further tuning can be done after the whole series has been merged.

## Port the whole pixel workflow to new Heterogeneous Framework (#370) [replaced by #372]
The `customizePixelTracksForProfiling` function was renamed, so the validation and profiling configurations need to be updated accordingly:
```python
# run the pixel reconstruction on GPU
from RecoPixelVertexing.Configuration.customizePixelTracksForProfiling import customizePixelTracksForProfilingGPUOnly
process = customizePixelTracksForProfilingGPUOnly(process)

# run the pixel reconstruction on GPU, and copy the SoA's to the CPU
from RecoPixelVertexing.Configuration.customizePixelTracksForProfiling import customizePixelTracksForProfilingEnableConversion
process = customizePixelTracksForProfilingEnableTransfer(process)

# run the pixel reconstruction on GPU, copy the SoA's to the CPU, and convert to the legacy format
from RecoPixelVertexing.Configuration.customizePixelTracksForProfiling import customizePixelTracksForProfilingEnableTransfer
process = customizePixelTracksForProfilingEnableConversion(process)

Implement a LegacyCluster to SoA@CPU for PixelRecHits (#371) [included in #372]

Make the vertex producer to run on both GPU and CPU (#372)

The GPU modules now do not produce any non-transient collection, so they cannot be scheduled by consuming their products from an OutputModule.

So, the customisation to run only the reconstruction must be applied after the conversion to unscheduled:

# End of customisation functions
#do not add changes to your config after this point (unless you know what you are doing)
from FWCore.ParameterSet.Utilities import convertToUnscheduled
process=convertToUnscheduled(process)

# run the pixel reconstruction on GPU
from RecoPixelVertexing.Configuration.customizePixelTracksForProfiling import customizePixelTracksForProfilingGPUOnly
process = customizePixelTracksForProfilingGPUOnly(process)

Implement full Pixel SoA workflow on CPU (#375)

...

VinInn commented 5 years ago

Should we centralize here the discussion instead of scattering it around the many PR that accumulated over the weeks?

VinInn commented 5 years ago

benchmark on patatrack02 T4 (w/o mps) No Transfer, No Conversion

reference is HEAD on 15/08
development is #375 both gcc 8.20

default configuration (which is not really apple vs apple due to #369 and it is optimized for Ideal)

reference 947.0 ± 40.0 ev/s
development 1009.0 ± 3.1 ev/s

No cluster Shape cut

reference 785.5 ± 2.9 ev/s
development 734.6 ± 1.2 ev/s

optimized realistic (again not really apple vs apple due to #369 see below )

reference 858.6 ± 6.0 ev/s
development 852.9 ± 0.8 ev/s

for this last case statistics

||Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
|| reference |  5000|  16482.7|  94935.7|  1105.0|  337.7|  1383.5|  4.7|  225.2|  0.968|  0.982||
|| development | 5000|  16482.7|  88107.2|  650.3|  327.9|  1340.9|  3.8|  324.0|  0.947|  0.986||

for the case NO cluster shape cut (reference reports several overflows, development 3 minor ones)

|Counters | nEvents | nHits | nCells | nTuples | nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
|| reference | 5000|  16482.7|  108218.1|  1285.0|  349.1|  1430.5|  5.0|  245.5|  0.967|  0.983||
|| development | 5000|  16482.7|  108284.3|  782.9| 356.8|  1461.2|  5.1|  361.0|  0.944|  0.987||

so yes, in development there are less doublets with CSC (expected) and less tracks (due to early fishbone?). the number of good-tracks is similar (see MTV for the physics interpretation)

running with lateFishbone development's counters look like

|| |  5000|  16482.7|  88147.5|  1141.4|  329.2|  1346.1|  3.8|  216.1|  0.947|  0.981||

and throughput is 819.9 ± 1.7 ev/s

fwyzard commented 5 years ago

OK.

I've rebased #362 on top of the current CMSSW_10_6_X_Patatrack branch so I can test the individual commits, to check the individual impact on the throughput; I'm also running on patatrack02 in case the performance there is different than on hilton.

VinInn commented 5 years ago

@fwyzard I would not be surprised if the throughput regression come all from theUsed.

I would like to note that this is a "multi-purpose bit-field flag" that in future can be used to mark a doublet passing certain cuts to be applied later in the workflow (for instance if one wants to optimize cuts depending on nhits and/or topology. (Of course with costs in terms of throughput).

Something on the list of TBD is also to test if a SOA like Cell be more efficient (I started with the infamous optimization of the memory that caused random crashes before Xmas).

VinInn commented 5 years ago

I have removed theUsed and all references to it in #375. In this case lateFishbone is mandatory as earlyFishbone is based on theUsed.

throughput becomes 824.9 ± 3.1 ev/s (counters unaffected as expected) theUsed + earlyFishbone is faster.

ok NOT all regression w/r/t reference comes from theUsed. There are additional kernel due to triplets.

VinInn commented 5 years ago

one can remove the last 4 kernels as well (see below)

throughput becomes 832.2 ± 2.9 ev/s counters unaffected

--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
@@ -264,6 +264,7 @@ void CAHitNtupletGeneratorKernelsGPU::classifyTuples(HitsOnCPU const &hh,
       device_theCells_.get(), device_nCells_, tuples_d, tracks_d);
   cudaCheck(cudaGetLastError());

+/*
   // fill hit->track "map"
   numberOfBlocks = (CAConstants::maxNumberOfQuadruplets() + blockSize - 1) / blockSize;
   kernel_countHitInTracks<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
@@ -280,6 +281,7 @@ void CAHitNtupletGeneratorKernelsGPU::classifyTuples(HitsOnCPU const &hh,
   kernel_tripletCleaner<<<numberOfBlocks, blockSize, 0, cudaStream>>>(
       hh.view(), tuples_d, tracks_d, quality_d, device_hitToTuple_.get());
   cudaCheck(cudaGetLastError());
+*/

   if (m_params.doStats_) {
     // counters (add flag???)

VinInn commented 5 years ago

if we do not run earlyFishbone we can remove also the full check in find_ntuplets

throughput becomes 838.8 ± 4.4 ev/s

@@ -276,8 +276,8 @@ __global__ void kernel_find_ntuplets(GPUCACell::Hits const *__restrict__ hhp,
   auto first = threadIdx.x + blockIdx.x * blockDim.x;
   for (int idx = first, nt = (*nCells); idx<nt; idx += gridDim.x * blockDim.x) {
     auto const &thisCell = cells[idx];
-    if (thisCell.theDoubletId < 0 || thisCell.theUsed>1)
-      continue;
+    // if (thisCell.theDoubletId < 0) // || thisCell.theUsed>1)
+    //  continue;

     // this stuff may need further optimization....
     bool myStart[3];

VinInn commented 5 years ago

and if I remove the myStart stuff needed for triplets and iterations we arrive at a throughput 856.7 ± 5.9 ev/s that is very close to reference

so if we give up Iterations and simplify the identification of the starting layer we can indeed gain some speed w/o touching the rest of the logic

VinInn commented 5 years ago

the full "roll-back" patch is in /afs/cern.ch/user/i/innocent/public/quadOnly.patch it requires

process.caHitNtupletCUDA.earlyFishbone = False
process.caHitNtupletCUDA.lateFishbone = True

and of course works for quadruplets only

VinInn commented 5 years ago

ok this will speed up 3%

+    auto pid = thisCell.theLayerPairId;
+    auto doit = minHitsPerNtuplet>3 ? pid <3 : pid <8 || pid > 12;

BUT it will reco only Quads starting from BPIX1 (as in reference): the previous version was also reconstructing quadruplets from layer2 in the pentuplet region..... (either bug or feature: choose)

with this config counters of #375 are

||Counters | nEvents | nHits | nCells | nTuples | nFitTacks  |  nGoodTracks | nUsedHits | nDupHits | nKilledCells | nEmptyCells | nZeroTrackCells ||
||  |  5000|  16482.7|  88147.5|  595.8|  514.3|  316.6|  1295.7|  3.9|  324.1|  0.947|  0.986||

the following alternative patch that preserves iterations speeds up quads only of ~1%

diff --git a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
index f30ba9581f6..ac6a0762ca2 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cu
@@ -94,7 +94,7 @@ void CAHitNtupletGeneratorKernelsGPU::launchKernels(
                                                                      device_hitTuple_apc_,
                                                                      quality_d,
                                                                      m_params.minHitsPerNtuplet_,
-                                                                     m_params.doIterations_ ? startLayer : -1);
+                                                                     m_params.doIterations_ ? startLayer : ((m_params.minHitsPerNtuplet_>3) ? 0 : -1));
     cudaCheck(cudaGetLastError());

     if (m_params.doIterations_ || m_params.doStats_)

fwyzard commented 5 years ago

OK, so we have

the reference version, with only quadruplets from BPIX1
[ ] one version with quadruplets
[ ] one version that adds quadruplets starting from BPIX2
[ ] one version that adds the feature bitfield
[ ] one version that adds triplets

I would like to do a comparison of their throughput vs the original one on hilton. Unfortunately something is making most of my tests fail since a couple of days, and I'm still investigating...

VinInn commented 5 years ago

I have created a new branch SOATV8 on top of #375 where the Iterations support has been removed, Quads are limited to start on BPIX1 and the last four kernels (hit2track assoc + triplet cleaner) are run only for Triplets or if Stat is required.

No regression observed for Triplets

Please let me know if you prefer that I add those commit to #375 or open a new PR

fwyzard commented 5 years ago

Please let me know if you prefer that I add those commit to #375 or open a new PR

I would prefer to review these changes directly on top of #362, to address there the performance issue, before moving to merge the following PRs.

However, if applying the changes to #362 is too complicated, just update #375: I can leave #362 on hold, review the following PRs, and merge them all after the full review.

VinInn commented 5 years ago

@fwyzard apparently #362 merges what I can do is to apply them to #362 AND #375 we can them play with merge and revert to move forward with the other PRs

fwyzard commented 5 years ago

On Sat, 17 Aug 2019, 12:05 Vincenzo Innocente, notifications@github.com wrote:

@fwyzard https://github.com/fwyzard apparently #362 https://github.com/cms-patatrack/cmssw/pull/362 merges what I can do is to apply them to #362 https://github.com/cms-patatrack/cmssw/pull/362 AND #375 https://github.com/cms-patatrack/cmssw/pull/375

OK !

VinInn commented 5 years ago

if I remove earlyDuplicateRemover from #375 I gain 20ev/s for quads and 10ev/s for triplets final physics performance should not be affected (tbc). Let's see how to make it optional...

VinInn commented 5 years ago

to give a metric: using "less strict" cuts for jumping doublets one looses 10% of the throughput.

cms-patatrack / cmssw

Integrate the recent Pixel developments #377

Implement full pixel triplets (#362) [included in and replaced by #375]

enable "jumping doublets" in the forward (just FPIX1,FPIX3)

(only for triplets) enable three iterations of `kernel_find_ntuplets`

- starting from different layers

- skipping doublets already used in previous ones

Implement a LegacyCluster to SoA@CPU for PixelRecHits (#371) [included in #372]

Make the vertex producer to run on both GPU and CPU (#372)

Implement full Pixel SoA workflow on CPU (#375)

cms-patatrack / cmssw

Integrate the recent Pixel developments #377

Implement full pixel triplets (#362) [included in and replaced by #375]

enable "jumping doublets" in the forward (just FPIX1,FPIX3)

(only for triplets) enable three iterations of kernel_find_ntuplets

- starting from different layers

- skipping doublets already used in previous ones

Implement a LegacyCluster to SoA@CPU for PixelRecHits (#371) [included in #372]

Make the vertex producer to run on both GPU and CPU (#372)

Implement full Pixel SoA workflow on CPU (#375)

(only for triplets) enable three iterations of `kernel_find_ntuplets`