Closed mmusich closed 5 months ago
@cms-sw/tracking-pog-l2 FYI
cms-bot internal usage
A new Issue was created by @mmusich.
@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign hlt, heterogeneous, reconstruction
FYI @AdrianoDee
New categories assigned: hlt,heterogeneous,reconstruction
@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
The failing assertion is a NaN check https://github.com/cms-sw/cmssw/blob/653fed5bbe82444345f0d2351d59884276983fef/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc#L163-L167
Curious.
Unrelated to the crash, but I wonder if
ALPAKA_ASSERT_ACC(not isnan(fast_fit(0)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(1)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(2)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(3)));
wouldn't be easier to understand when reading the code ?
Added the following printf in alpaka/BrokenLine.h
printf("fastFit: 0=%f 1=%f 2=%f 3=%f tmp=%f a=(%f,%f) b=(%f,%f) c=(%f,%f)\n", result(0), result(1), result(2), result(3), tmp, a(0), a(1), b(0), b(1), c(0), c(1));
Output on GPU.
fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)
Output on CPU.
fastFit: 0=-303360091.997263 1=-29751183.422276 2=304815482.465272 3=0.534923 tmp=-631410.897150 a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)
On GPU, both riemannFit::cross2D(acc, b, a)
and riemannFit::cross2D(acc, c, a)
are zero in this case.
The CUDA implementation (tested by just using a different HLT menu [1]) also produces the nan
discussed above. The CUDA implementation also has similar assert
calls (link), but (if I understand correctly) these assert
calls do nothing when running the CUDA implementation (because NDEBUG
is defined).
How to proceed ? Remove the ALPAKA_ASSERT_ACC
calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?
[1]
#!/bin/bash
jobLabel=test_cmssw44786_cuda
if [ ! -f "${jobLabel}"_cfg.py ]; then
https_proxy=http://cmsproxy.cms:3128/ \
hltGetConfiguration run:379660 \
--globaltag 140X_dataRun3_HLT_v3 \
--data \
--no-prescale \
--no-output \
--paths AlCa_PFJet40_v* \
--max-events 1 \
--input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root \
> "${jobLabel}"_cfg.py
cat <<@EOF >> "${jobLabel}"_cfg.py
del process.hltL1sZeroBias
if hasattr(process, 'HLTAnalyzerEndpath'):
del process.HLTAnalyzerEndpath
try:
del process.MessageLogger
process.load('FWCore.MessageLogger.MessageLogger_cfi')
process.MessageLogger.cerr.enableStatistics = False
except:
pass
process.options.numberOfThreads = 1
process.options.numberOfStreams = 0
process.source.skipEvents = cms.untracked.uint32( 86 )
process.options.wantSummary = True
@EOF
fi
CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${jobLabel}"_cfg.py &> "${jobLabel}".log
type tracking
How to proceed ? Remove the
ALPAKA_ASSERT_ACC
calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?
Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).
On GPU, both
riemannFit::cross2D(acc, b, a)
andriemannFit::cross2D(acc, c, a)
are zero in this case.
So the three points are basically aligned. @AdrianoDee @rovere @VinInn @slava77 would it make sense to check for this and handle in some special way ?
Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).
while agreeing with the approach, we need to make soon an assessment because every day we spend discussing this, is one less day we take data with the menu V1.1 (and all its physics triggers updates). So we either:
CMSSW_14_0_5_patch2
with the potential pitfall of this kind of crash (hopefully it will be low frequency - 1 single occurrence in one run 379617, we still need to look at run 379613 though)I do not see any reason for
- prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.
remove temporarily the assert in the alpaka code while something better is prepared by the experts.
I am loath to adopt "temporary" solutions, because they have the undesirable tendency to stick around much longer than intended.
1) We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host)) 2) Why are we asserting in Alpaka: is this not making the gpu version dramatically slower? 3) We need to emit an error: a) we do not have (yet) a mechanism to propagate errors from device to host b) most probably we can just leave the NaN percolate and catch it later on Host c) we should just invalidate that track
- We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))
Technically, I think you did :-p The oldest commit I could find with that code already has the comment and assert, and they made their way through the patatrack integration in CMSSW and the porting to Alpaka.
- Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?
Probably yes. But we also have a lot of extra code being put online for the first time, so I decided it was a good idea to keep the asserts enabled while we commission the new code. We can re-measure the impact of keeping asserts in GPU code, and decide what to do once all crashed have been solved.
- We need to emit an error:
- a) we do not have (yet) a mechanism to propagate errors from device to host
Technically a device-side assert does just that: it makes the kernel fail, which is caught by an asynchronous exception on the host. But I agree it's not a very nice approach.
- b) most probably we can just leave the NaN percolate and catch it later on Host
OK - but is there already some code on the host that would catch it and ...
- c) we should just invalidate that track
... do this ?
OK - but is there already some code on the host that would catch it and ... ... do this ?
Yes, the track would be discarded later in the chain https://github.com/cms-sw/cmssw/blob/3fb8b0edb88bcba34579ca4b9867427862eda51d/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h#L524-L532
I checked that it is the case for the event there (when disabling the NaN checks). Note that this is not only a single event, it's actually a single triplet in the whole run.
Curious - std::isnan
is not constexpr
(until c++23), so in principle it should not work in a kernel.
c) we should just invalidate that track
by the way - why ?
3 (almost) aligned hits would point to a very high pT track. Why should we invalidate it ?
a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)
could be printed with full precision using "%a" instead of "%f"? I would like to test if it is FMA that makes the cross product zero or something else
I added the printf and run on CPU: got no output! Which code is running on a machine w/o gpu?
edited the file in interface, not in inteface/alpaka.... hope we get rid of all these duplications soon
found
fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0x1.79a72p-2,-0x1.e1588p+1) b=(0x1.de4706p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)
this is on cpu I suppose
fastFit: 0=-0x1.214e85bff4ca4p+28 1=-0x1.c5f78f6c1a45ap+24 2=0x1.22b1d7a771c18p+28 3=0x1.11e163587decap-1 tmp=-0x1.344e5cb57444ap+19 a=(0x1.79a724p-2,-0x1.e1588p+1) b=(0x1.de4704p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)
is not a subtle precision issue: the cross product of those three double precision numbers (any combination) is zero (on cpu as well) no matter how one computes them. On cpu the imputs have few bits different
A track with zero curvature will sooner or later produce a NaN and somebody will reject it. We are not even sure that a very small value will survive further manipulations down-stream: so, unless somebody goes around and make sure C=0 is properly propagated, the best is just to keep the nan. For sure the asserts shall be removed even if exactly ZERO in double precision is always a bit suspicision (still can happen, are only 53bits after all...) and it can happen on cpu as well. In the printout is "plenty" of values that could be "inf" in float
we still need to look at run 379613 though
for the record, we checked all the available error stream files for that run with the following script [1] in CMSSW_14_0_5_patch2
and found no other crashes.
This particular issue should be dealt with by https://github.com/cms-sw/cmssw/pull/44808 (if accepted).
[1]
+hlt
CMSSW_14_0_6_MULTIARCHS
(intermediate patches were not used online) together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS
, using the "v1.1.X" HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No further issues were spotted.+heterogeneous
While reviewing the whole list of error streamer files from run 379617 (related issue https://github.com/cms-sw/cmssw/issues/44769) stored on
/eos/cms/store/group/tsg/FOG/debug/240417_run379617/
to ascertain ifCMSSW_14_0_5_patch2
fixed all of them using the following script [1] I've found a single instance which still crashes taking in input the file/eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root
.To reproduce:
and then running:
On
lxplus-gpu
the following assertion is hit:while on
lxplus
(so on CPU) no crash is observed.@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI
[1]
Click me
```bash #!/bin/bash -ex # CMSSW_14_0_5_patch2 hltGetConfiguration run:379617 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input file:converted.root > hlt.py cat <<@EOF >> hlt.py process.options.numberOfThreads = 32 process.options.numberOfStreams = 32 @EOF # Define a function to execute each iteration of the loop process_file() { inputfile="$1" outputfile="${inputfile%.root}" cp hlt.py hlt_${outputfile}.py sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py cmsRun hlt_${outputfile}.py &> "${outputfile}.log" } # Export the function so it can be used by parallel export -f process_file # Find the root files and run the function in parallel using GNU Parallel eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file ```