Closed fwyzard closed 3 years ago
@VinInn could you have a look ?
with pre7 cannot reproduce on patatrack02 btw: the job is fully sequential: threads are sitting on futex wait all the time got this time to time
[2020-10-25 12:08:17.011886 +0100][Error ][PostMaster ] [cmsxrootd-site1.fnal.gov:1093 #0] Forcing error on disconnect: [ERROR] Operation interrupted.
and takes forever
with pre8 doesn't crash either...
do not understand: the file is served from Italy root://xrootd-cms.infn.it//store/relval/CMSSW_11_2_0_pre7/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_112X_mcRun3_2021_realistic_v8-v1/20000/8BD20F29-96F9-7C44-9078-E641186F0B19.root why a fully reproducible error from Fermi? [2020-10-25 15:34:58.032207 +0100][Error ][PostMaster ] [cmsxrootd-site2.fnal.gov:1093 #0] Forcing error on disconnect: [ERROR] Operation interrupted.
This seems frequently reproducible with pre10:
cmsrel CMSSW_11_2_0_pre10_Patatrack
cd CMSSW_11_2_0_pre10_Patatrack
cmsenv
xrdcp root://cmsxrootd.fnal.gov//store/relval/CMSSW_11_2_0_pre9/RelValTTbar_14TeV/GEN-SIM-DIGI-RAW/PU_112X_mcRun3_2021_realistic_v11-v1/00000/f6888c6e-1fe6-413e-b2b5-e54ff3a4fe2b.root .
cmsDriver.py step3 \
--geometry DB:Extended \
--era Run3 \
--conditions auto:phase1_2021_realistic \
-s RAW2DIGI:RawToDigi_pixelOnly,RECO:reconstruction_pixelTrackingOnly,VALIDATION:@pixelTrackingOnlyValidation,DQM:@pixelTrackingOnlyDQM \
-n 100 \
--filein file:f6888c6e-1fe6-413e-b2b5-e54ff3a4fe2b.root \
--eventcontent RECOSIM,DQM \
--datatier GEN-SIM-RECO,DQMIO \
--customise RecoPixelVertexing/Configuration/customizePixelTracksSoAonCPU.customizePixelTracksSoAonCPU,RecoPixelVertexing/Configuration/customizePixelTracksSoAonCPU.customizePixelTracksForTriplets \
--nThreads 8 \
--no_exec
cmsRun step3_RAW2DIGI_RECO_VALIDATION_DQM.py
results in
%MSG-i ThreadStreamSetup: (NoModuleName) 28-Nov-2020 17:40:25 CET pre-events
setting # threads 8
setting # streams 8
%MSG
28-Nov-2020 17:40:34 CET Initiating request to open file file:f6888c6e-1fe6-413e-b2b5-e54ff3a4fe2b.root
28-Nov-2020 17:40:40 CET Successfully opened file file:f6888c6e-1fe6-413e-b2b5-e54ff3a4fe2b.root
...
Begin processing the 1st record. Run 1, Event 7405, LumiSection 75 on stream 4 at 28-Nov-2020 17:41:01.137 CET
Begin processing the 2nd record. Run 1, Event 7402, LumiSection 75 on stream 3 at 28-Nov-2020 17:41:01.160 CET
Begin processing the 3rd record. Run 1, Event 7404, LumiSection 75 on stream 2 at 28-Nov-2020 17:41:01.162 CET
Begin processing the 4th record. Run 1, Event 7407, LumiSection 75 on stream 5 at 28-Nov-2020 17:41:01.164 CET
Begin processing the 5th record. Run 1, Event 7403, LumiSection 75 on stream 7 at 28-Nov-2020 17:41:01.167 CET
Begin processing the 6th record. Run 1, Event 7408, LumiSection 75 on stream 1 at 28-Nov-2020 17:41:01.169 CET
Begin processing the 7th record. Run 1, Event 7401, LumiSection 75 on stream 0 at 28-Nov-2020 17:41:01.172 CET
Begin processing the 8th record. Run 1, Event 7406, LumiSection 75 on stream 6 at 28-Nov-2020 17:41:01.174 CET
A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.
The full report is attached: crash.log.
this is suspicious (TLS)
#4 0x00007f9b5301d947 in _dl_update_slotinfo () from /lib64/ld-linux-x86-64.so.2
#5 0x00007f9b5300c098 in update_get_addr () from /lib64/ld-linux-x86-64.so.2
#6 0x00007f9b530229f8 in __tls_get_addr () from /lib64/ld-linux-x86-64.so.2
need to verify if shows up consistently...
confirmed (only if running from local file, from xrootd does not) in my case no TLS
minimal recompiled with -g run 5 times under gdb no crash managed to to crash it only once w/o gdb...
eventually crashed
hread 3 "cmsRun" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fff879ff700 (LWP 219860)]
GPUCACell::init (outerHitId=4014, innerHitId=32, doubletId=0, layerPairId=0, hh=..., cellTracks=..., cellNeighbors=..., this=0x7fff35b02780)
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:107
107 __device__ __forceinline__ CellNeighbors& outerNeighbors() { return *theOuterNeighbors; }
(gdb) where
#0 GPUCACell::init (outerHitId=4014, innerHitId=32, doubletId=0, layerPairId=0, hh=..., cellTracks=..., cellNeighbors=..., this=0x7fff35b02780)
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:107
#1 gpuPixelDoublets::doubletsFromHisto (maxNumOfDoublets=524288, doPtCut=true, doZ0Cut=true, doClusterCut=true, ideal_cond=true, maxr=0x7fff7921fce0 <gpuPixelDoublets::maxr>,
maxz=0x7fff7921fd40 <gpuPixelDoublets::maxz>, minz=0x7fff7921fda0 <gpuPixelDoublets::minz>, phicuts=0x7fff7921fe00 <gpuPixelDoublets::phicuts>, isOuterHitOfCell=0x7fff9e603880, hh=...,
cellTracks=0x7fff85f49880, cellNeighbors=0x7fff85f49870, nCells=0x7fff85ff9950, cells=0x7fff35b02780, nPairs=<optimized out>, layerPairs=0x7fff7921fe40 <gpuPixelDoublets::layerPairs> "")
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/gpuPixelDoubletsAlgos.h:226
#2 gpuPixelDoublets::getDoubletsFromHisto (maxNumOfDoublets=524288, doPtCut=true, doZ0Cut=true, doClusterCut=true, ideal_cond=true, nActualPairs=<optimized out>, isOuterHitOfCell=0x7fff9e603880,
hhp=0x7fff5d09d360, cellTracks=0x7fff85f49880, cellNeighbors=0x7fff85f49870, nCells=0x7fff85ff9950, cells=0x7fff35b02780)
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/gpuPixelDoublets.h:109
#3 CAHitNtupletGeneratorKernels<cms::cudacompat::CPUTraits>::buildDoublets (this=this@entry=0x7fff879f7900, hh=..., stream=stream@entry=0x0)
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc:56
#4 0x00007fff791c6682 in CAHitNtupletGeneratorOnGPU::makeTuples (this=this@entry=0x7fffb9b9b020, hits_d=..., bfield=0.0114256972)
at /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorOnGPU.cc:209
#5 0x00007fff791b3c7e in CAHitNtupletCUDA::produce (this=0x7fffb9b9b000, streamID=..., iEvent=..., es=...)
``
will add assert
cmsRun: /home/innocent/ORI/CMSSW_11_2_0_pre10_Patatrack/src/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h:61: void GPUCACell::init(GPUCACell::CellNeighborsVector&, GPUCACell::CellTracksVector&, const Hits&, int, int, GPUCACell::hindex_type, GPUCACell::hindex_type): Assertion `theOuterNeighbors' failed.
the arrays are not initialized on CPU?
cudaCompat issue Assertion `0==blockIdx.x*blockDim.x + threadIdx.x' failed.
fixed in principle one needs to "resetGrid" before each kernel. in practice only where needed (if one knows what is doing and in which order the kernel are called) the latter is more difficult to control....
this is the patch with all assert in place
[innocent@patatrack02 src]$ git diff
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc
index 1646cb503ff..7a55e73ecd1 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc
+++ b/RecoPixelVertexing/PixelTriplets/plugins/CAHitNtupletGeneratorKernels.cc
@@ -12,6 +12,8 @@ void CAHitNtupletGeneratorKernelsCPU::fillHitDetIndices(HitsView const *hv, TkSo
template <>
void CAHitNtupletGeneratorKernelsCPU::buildDoublets(HitsOnCPU const &hh, cudaStream_t stream) {
+
+ resetGrid();
auto nhits = hh.nHits();
#ifdef NTUPLE_DEBUG
@@ -31,7 +33,7 @@ void CAHitNtupletGeneratorKernelsCPU::buildDoublets(HitsOnCPU const &hh, cudaStr
device_theCellTracksContainer_ =
(GPUCACell::CellTracks *)(cellStorage_.get() +
CAConstants::maxNumOfActiveDoublets() * sizeof(GPUCACell::CellNeighbors));
-
+ assert(0==blockIdx.x*blockDim.x + threadIdx.x);
gpuPixelDoublets::initDoublets(device_isOuterHitOfCell_.get(),
nhits,
device_theCellNeighbors_.get(),
@@ -39,6 +41,7 @@ void CAHitNtupletGeneratorKernelsCPU::buildDoublets(HitsOnCPU const &hh, cudaStr
device_theCellTracks_.get(),
device_theCellTracksContainer_);
+ assert(!(*device_theCellNeighbors_).empty());
// device_theCells_ = Traits:: template make_unique<GPUCACell[]>(cs, m_params.maxNumberOfDoublets_, stream);
device_theCells_.reset((GPUCACell *)malloc(sizeof(GPUCACell) * m_params.maxNumberOfDoublets_));
if (0 == nhits)
diff --git a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
index e913b77fe09..f3b33bf47d6 100644
--- a/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
+++ b/RecoPixelVertexing/PixelTriplets/plugins/GPUCACell.h
@@ -58,7 +58,9 @@ public:
// link to default empty
theOuterNeighbors = &cellNeighbors[0];
+ assert(theOuterNeighbors);
theTracks = &cellTracks[0];
+ assert(theTracks);
assert(outerNeighbors().empty());
assert(tracks().empty());
}
@@ -76,6 +78,7 @@ public:
(ptrAsInt)(&cellNeighbors[i])); // if fails we cannot give "i" back...
#else
theOuterNeighbors = &cellNeighbors[i];
+ assert(theOuterNeighbors);
#endif
} else
return -1;
@@ -94,6 +97,7 @@ public:
atomicCAS((ptrAsInt*)(&theTracks), zero, (ptrAsInt)(&cellTracks[i])); // if fails we cannot give "i" back...
#else
theTracks = &cellTracks[i];
+ assert(theTracks);
#endif
} else
return -1;
@@ -102,10 +106,10 @@ public:
return tracks().push_back(t);
}
- __device__ __forceinline__ CellTracks& tracks() { return *theTracks; }
- __device__ __forceinline__ CellTracks const& tracks() const { return *theTracks; }
- __device__ __forceinline__ CellNeighbors& outerNeighbors() { return *theOuterNeighbors; }
- __device__ __forceinline__ CellNeighbors const& outerNeighbors() const { return *theOuterNeighbors; }
+ __device__ __forceinline__ CellTracks& tracks() { assert(theTracks); return *theTracks; }
+ __device__ __forceinline__ CellTracks const& tracks() const { assert(theTracks); return *theTracks; }
+ __device__ __forceinline__ CellNeighbors& outerNeighbors() { assert(theOuterNeighbors); return *theOuterNeighbors; }
+ __device__ __forceinline__ CellNeighbors const& outerNeighbors() const { assert(theOuterNeighbors); return *theOuterNeighbors; }
__device__ __forceinline__ float get_inner_x(Hits const& hh) const { return hh.xGlobal(theInnerHitId); }
__device__ __forceinline__ float get_outer_x(Hits const& hh) const { return hh.xGlobal(theOuterHitId); }
__device__ __forceinline__ float get_inner_y(Hits const& hh) const { return hh.yGlobal(theInnerHitId); }
OK, I admit I'm confused: do we ever use a gridDim
different from {1, 1, 1}
in compatibility mode on the CPU ?
I think yes (clustering?). When the blockId has a specific meaning (such as detectors) And in principle is reset afterward. Of course is possible to change the code and make sure the "loop" on detectors in inside the kernel. I think I realize that at some point and was waiting integration was over to change those kernel to be fully "sequential" compatible as well (so not to depend to the grid size)
Why is crashing now: no clue. Maybe streams and threads are not one-to-one.
here you see
RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc: cms::cudacompat::resetGrid();
you are right. on CPU we DO NOT run the patatrack-clusterizer.... I am confused as well. Also because it is called in RecoLocalTracker/SiPixelRecHits/plugins/SiPixelRecHitSoAFromLegacy.cc
Why is crashing now: no clue. Maybe streams and threads are not one-to-one.
They are not guaranteed to be, no.
We also had rare cases where TBB after a while "retires" a worker thread, and spawns a new one; if the thread_local
variable are initialised only at the beginning of the job, they would end up being uninitialised in this case.
ok, most probably we need to resetGrid()
in each produce. adding to vertex as well.
We should be fully covered for the time being.
With these changes
diff --git a/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h b/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
index f9b4b2f8a4c1..e8aa4cdc1b06 100644
--- a/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
+++ b/HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h
@@ -21,11 +21,11 @@ namespace cms {
uint32_t x, y, z;
};
#endif
+
const dim3 threadIdx = {0, 0, 0};
+ const dim3 blockIdx = {0, 0, 0};
const dim3 blockDim = {1, 1, 1};
-
- extern thread_local dim3 blockIdx;
- extern thread_local dim3 gridDim;
+ const dim3 gridDim = {1, 1, 1};
template <typename T1, typename T2>
T1 atomicCAS(T1* address, T1 compare, T2 val) {
@@ -78,10 +78,7 @@ namespace cms {
return *x;
}
- inline void resetGrid() {
- blockIdx = {0, 0, 0};
- gridDim = {1, 1, 1};
- }
+ inline void resetGrid() {}
} // namespace cudacompat
} // namespace cms
diff --git a/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc b/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
index 7b8efda8e381..0b94c8f1d4b8 100644
--- a/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
+++ b/HeterogeneousCore/CUDAUtilities/src/cudaCompat.cc
@@ -1,12 +1,5 @@
#include "HeterogeneousCore/CUDAUtilities/interface/cudaCompat.h"
-namespace cms {
- namespace cudacompat {
- thread_local dim3 blockIdx;
- thread_local dim3 gridDim;
- } // namespace cudacompat
-} // namespace cms
-
namespace {
struct InitGrid {
InitGrid() { cms::cudacompat::resetGrid(); }
diff --git a/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h b/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
index 0da24cef219e..987b0af91dbd 100644
--- a/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
+++ b/RecoPixelVertexing/PixelVertexFinding/plugins/gpuVertexFinderImpl.h
@@ -157,8 +157,8 @@ namespace gpuVertexFinder {
// std::cout << "found " << (*ws_d).nvIntermediate << " vertices " << std::endl;
fitVertices(soa, ws_d.get(), 50.);
// one block per vertex!
- blockIdx.x = 0;
- gridDim.x = 1;
+ assert(blockIdx.x == 0);
+ assert(gridDim.x == 1);
splitVertices(soa, ws_d.get(), 9.f);
resetGrid();
fitVertices(soa, ws_d.get(), 5000.);
all src
and plugins
build fine.
There is one test that actually uses a grid of different size, RecoLocalTracker/SiPixelClusterizer/test/gpuClustering_t.h
, but it also says
gridDim.x = MaxNumModules; //not needed in the kernel for this specific case;
So... would it be OK to simple make the gridDim
constant, equal to {1, 1, 1}
(and adjust the test accordingly) ?
Othwerise, since
ok, most probably we need to
resetGrid()
in each produce. adding to vertex as well.
would it make sense to wrap all CPU "kernel" calls in something like cms::cudacompat::launch(...)
that would take care of setting the grids and blocks properly ?
Once I did wrap every kernel in a modified version of "your" launch and modified all drivers (cu and cc) and many of them were identical at that point. We did not agreed that was the time and the way to do it
For what concern the Clusterizer, as I said, the kernel must be modified to be independent from the grid size. Once done there is no need to play with the blockId even there and indeed WE (well the pixel code) can just run with the cudaCompat you propose.
Once I did wrap every kernel in a modified version of "your" launch and modified all drivers (cu and cc) and many of them were identical at that point.
Yes; I that that was https://github.com/cms-patatrack/cmssw/pull/428 ?
We did not agreed that was the time and the way to do it
About the time: I'd still rather do it after the integration upstream, now that it's finally getting close to happening.
About the way: I'd prefer to keep cms::cuda::launch()
CUDA-only, and implement a separate cms::cudautils::launch()
that calls cms::cuda::launch()
for CUDA, or the CPU variant for a CPU-only case.
The reason being to make it easier to transition to something else later (be it Alpaka, Kokkos, SYCL, etc.).
For what concern the Clusterizer, as I said, the kernel must be modified to be independent from the grid size.
I take your word for it - I just don't find where the grid size is passed to the cpu kernel(s) ?
this code
if (blockIdx.x >= moduleStart[0])
return;
auto firstPixel = moduleStart[1 + blockIdx.x];
depends on the grid size
instead it should loop as we loop for the threadIdx
so on cpu (in the test) we are forced to loop in the driver....
gridDim.x = MaxNumModules; // no needed in the kernel for in this specific case
assert(blockIdx.x == 0);
for (; blockIdx.x < gridDim.x; ++blockIdx.x)
clusterChargeCut(
h_id.get(), h_adc.get(), h_moduleStart.get(), h_clusInModule.get(), h_moduleId.get(), h_clus.get(), n);
resetGrid();
OK, I see.
So
RecoLocalTracker/SiPixelClusterizer/plugins/gpuClusterChargeCut.h
require setting the grid size depending on the number of modules, and cannot run with the simple {1,1,1}
grid sizeRecoLocalTracker/SiPixelClusterizer/test/gpuClustering_t.h
, which then rely on being able to set the grid size and loop over the blocks when running on the CPURecoLocalTracker/SiPixelClusterizer/plugins/SiPixelRawToClusterGPUKernel.cu
?
Do you think we can change the kernels to run with the {1, 1, 1)
grid, or that would bring any downsides (performance, etc.) ?
If the kernels can be changes to work with the {1, 1, 1}
grid, my preferences would be
cudacompat
code to use a constant, fixed grid size of {1, 1, 1}
cudacompat
with a user friendly launch
function that takes care of setting the grid size, etc.Otherwise (if the kernels cannot be made to use a {1, 1, 1}
grid size without loss of performance or other downsides), my preference would be:
cudacompat
with a user friendly launch
function that takes care of setting the grid size, etc.@VinInn, which of these two, or what other option, would you prefer ?
@makortel since this touches also on the more "core" part of the compatibility layer, do you have any opinions ?
Second question:
@tsusa @mmusich do you have any opinions about this last point ?
I think the best option is to adopt the modification you (@fwyzard) propose. I was planning to change the clusterizer kernels anyway to make it independent on grid setting (as all other kernel in pixel code).
One possibility is to integrate your changes and "inhibit" the test (/* code */
) with a clear comment that need to be fixed.
Then Fix it after integration.
The clusterizer was never integrated in CPU workflows as it requires the Raw2dDgi to be ported first (and that was supposed to happen after integration to benefit of a coherent integration of Legacy and SoA code)
If decision is to change clusterizer kernel now I can work on that this week (provided all other changes to LocalReco had been already integrated: I do not want to run in merging issues)
Looking at the open PRs (https://github.com/cms-patatrack/cmssw/pulls/) I don't think there should be conflicts, but I'm not 100% sure.
So I, if I understood correctly your comments, I think we could
cudacompat
grid be a {1, 1, 1}
constant size and comments out the testThe first should happen before the integration (I can do it tomorrow or Wednesday). The secondcan happen before or after, depending on the timeline.
@fwyzard +1
Thanks to long DAQ meeting, the first part should be done by #586 .
clusterizer fixed in #588 . the 4 line modification is completely swamped by the code-format re-indentation.
since this touches also on the more "core" part of the compatibility layer, do you have any opinions ?
Fixed by #586 and #588 .
In recent Patatrack releases (both
CMSSW_11_2_0_pre7
with the currentCMSSW_11_2_X_Patatrack
branch, andCMSSW_11_2_0_pre8
with the currentmaster
branch) I see frequent crashes in the11634.501
workflow, that is, Patatrack pixel quadruplets running on the CPU.This may actually have been there for a while, and have been revealed by the update to the workflow (before we were actually testing the legacy pixel tracks with the new fits).
I've observed this using the relvals and global tags from
CMSSW_11_2_0_pre3
andCMSSW_11_2_0_pre7
, so it's likely not dependent on the input data.To reproduce it with pre7:
To reproduce it with pre8 (only the Patatrack branch and global tag are different):
The input file is available under
/gpu_data/store/...
on the online machines, under/data/store/...
on vocms006, and otherwiase over xrootd.