HLT farm crash in run 379617 (part-2)

mmusich commented 5 months ago

While reviewing the whole list of error streamer files from run 379617 (related issue https://github.com/cms-sw/cmssw/issues/44769) stored on /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ to ascertain if CMSSW_14_0_5_patch2 fixed all of them using the following script [1] I've found a single instance which still crashes taking in input the file /eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root .

To reproduce:

cmsrel CMSSW_14_0_5_patch2
cd CMSSW_14_0_5_patch2/src
cmsenv

and then running:

#!/bin/bash -ex

# CMSSW_14_0_5_patch2

hltGetConfiguration run:379617 \
  --globaltag 140X_dataRun3_HLT_v3 \
  --data \
  --no-prescale \
  --no-output \
  --max-events -1 \
  --input /store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root  > hlt.py

cmsRun hlt.py &> hlt.log

On lxplus-gpu the following assertion is hit:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
src/HeterogeneousCore/CUDAUtilities/src/CachingDeviceAllocator.h, line 617:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered

A fatal system signal has occurred: abort signal
The following is the call stack containing the origin of the signal.

src/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc:167: void alpaka_cuda_async::Kernel_BLFastFit<N, TrackerTraits>::operator()(const TAcc &, const reco::TrackSoA<TrackerTraits>::HitContainer *, const cms::alpakatools::OneToManyAssocRandomAccess<TrackerTraits::tindex_type, <expression>, TrackerTraits::maxNumberOfTuples> *, TrackingRecHitSoA<TrackerTraits>::Layout::ConstView, const pixelCPEforDevice::ParamsOnDeviceT<TrackerTraits> *, TrackerTraits::tindex_type *, double *, float *, double *, unsigned int, unsigned int, signed int) const [with TAcc = alpaka::AccGpuUniformCudaHipRt<alpaka::ApiCudaRt, std::integral_constant<unsigned long, 1UL>, unsigned int>; <template-parameter-2-2> = void; int N = 3; TrackerTraits = pixelTopology::Phase1]: block:[69,0,0], thread: [2,0,0] Assertion `fast_fit(3) == fast_fit(3)` failed.

while on lxplus (so on CPU) no crash is observed.

@cms-sw/hlt-l2 FYI @cms-sw/heterogeneous-l2 FYI

[1]

Click me

```bash #!/bin/bash -ex # CMSSW_14_0_5_patch2 hltGetConfiguration run:379617 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input file:converted.root > hlt.py cat <<@EOF >> hlt.py process.options.numberOfThreads = 32 process.options.numberOfStreams = 32 @EOF # Define a function to execute each iteration of the loop process_file() { inputfile="$1" outputfile="${inputfile%.root}" cp hlt.py hlt_${outputfile}.py sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379617\/${inputfile}/g" hlt_${outputfile}.py cmsRun hlt_${outputfile}.py &> "${outputfile}.log" } # Export the function so it can be used by parallel export -f process_file # Find the root files and run the function in parallel using GNU Parallel eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379617/ | grep '\.root$' | parallel -j 8 process_file ```

mmusich commented 5 months ago

@cms-sw/tracking-pog-l2 FYI

cmsbuild commented 5 months ago

cms-bot internal usage

cmsbuild commented 5 months ago

A new Issue was created by @mmusich.

@rappoccio, @antoniovilela, @smuzaffar, @makortel, @Dr15Jones, @sextonkennedy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 5 months ago

assign hlt, heterogeneous, reconstruction

makortel commented 5 months ago

FYI @AdrianoDee

cmsbuild commented 5 months ago

New categories assigned: hlt,heterogeneous,reconstruction

@Martin-Grunewald,@mmusich,@fwyzard,@jfernan2,@makortel,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 5 months ago

The failing assertion is a NaN check https://github.com/cms-sw/cmssw/blob/653fed5bbe82444345f0d2351d59884276983fef/RecoTracker/PixelSeeding/plugins/alpaka/BrokenLineFit.dev.cc#L163-L167

fwyzard commented 5 months ago

Curious.

Unrelated to the crash, but I wonder if

ALPAKA_ASSERT_ACC(not isnan(fast_fit(0)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(1)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(2)));
ALPAKA_ASSERT_ACC(not isnan(fast_fit(3)));

wouldn't be easier to understand when reading the code ?

missirol commented 5 months ago

Added the following printf in alpaka/BrokenLine.h

printf("fastFit: 0=%f 1=%f 2=%f 3=%f tmp=%f a=(%f,%f) b=(%f,%f) c=(%f,%f)\n", result(0), result(1), result(2), result(3), tmp, a(0), a(1), b(0), b(1), c(0), c(1));

Output on GPU.

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

Output on CPU.

fastFit: 0=-303360091.997263 1=-29751183.422276 2=304815482.465272 3=0.534923 tmp=-631410.897150 a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

missirol commented 5 months ago

The CUDA implementation (tested by just using a different HLT menu [1]) also produces the nan discussed above. The CUDA implementation also has similar assert calls (link), but (if I understand correctly) these assert calls do nothing when running the CUDA implementation (because NDEBUG is defined).

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?

[1]

#!/bin/bash

jobLabel=test_cmssw44786_cuda

if [ ! -f "${jobLabel}"_cfg.py ]; then

  https_proxy=http://cmsproxy.cms:3128/ \
  hltGetConfiguration run:379660 \
    --globaltag 140X_dataRun3_HLT_v3 \
    --data \
    --no-prescale \
    --no-output \
    --paths AlCa_PFJet40_v* \
    --max-events 1 \
    --input root://eoscms.cern.ch//eos/cms/store/group/tsg/FOG/debug/240417_run379617/run379617_ls0329_index000242_fu-c2b02-12-01_pid3327112.root \
    > "${jobLabel}"_cfg.py

  cat <<@EOF >> "${jobLabel}"_cfg.py

del process.hltL1sZeroBias

if hasattr(process, 'HLTAnalyzerEndpath'):
    del process.HLTAnalyzerEndpath

try:
    del process.MessageLogger
    process.load('FWCore.MessageLogger.MessageLogger_cfi')
    process.MessageLogger.cerr.enableStatistics = False
except:
    pass

process.options.numberOfThreads = 1
process.options.numberOfStreams = 0

process.source.skipEvents = cms.untracked.uint32( 86 )

process.options.wantSummary = True
@EOF
fi

CUDA_LAUNCH_BLOCKING=1 \
cmsRun "${jobLabel}"_cfg.py &> "${jobLabel}".log

slava77 commented 5 months ago

type tracking

fwyzard commented 5 months ago

How to proceed ? Remove the ALPAKA_ASSERT_ACC calls in the Alpaka version ? Or, add some kind of safeguard against these zero/nan/inf values ? Or, else ?

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

On GPU, both riemannFit::cross2D(acc, b, a) and riemannFit::cross2D(acc, c, a) are zero in this case.

So the three points are basically aligned. @AdrianoDee @rovere @VinInn @slava77 would it make sense to check for this and handle in some special way ?

mmusich commented 5 months ago

Naively I don't think that just removing the assert is a good approach (dust, carpet, you get the idea).

while agreeing with the approach, we need to make soon an assessment because every day we spend discussing this, is one less day we take data with the menu V1.1 (and all its physics triggers updates). So we either:

bit the bullet and put online V1.1 with CMSSW_14_0_5_patch2 with the potential pitfall of this kind of crash (hopefully it will be low frequency - 1 single occurrence in one run 379617, we still need to look at run 379613 though)
remove temporarily the assert in the alpaka code while something better is prepared by the experts.
prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

fwyzard commented 5 months ago

I do not see any reason for

prepare a version "V1.05" menu with the other updates while reverting the pixel local reconstruction to CUDA (which AFAIU has the same problem though). This will be very labor intensive on the STORM side.

fwyzard commented 5 months ago

remove temporarily the assert in the alpaka code while something better is prepared by the experts.

I am loath to adopt "temporary" solutions, because they have the undesirable tendency to stick around much longer than intended.

VinInn commented 5 months ago

1) We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host)) 2) Why are we asserting in Alpaka: is this not making the gpu version dramatically slower? 3) We need to emit an error: a) we do not have (yet) a mechanism to propagate errors from device to host b) most probably we can just leave the NaN percolate and catch it later on Host c) we should just invalidate that track

fwyzard commented 5 months ago

We never assert on NaN (please remove the assert, I do not know who introduced those and in any case none of those are safe in case of fast-math (on host))

Technically, I think you did :-p The oldest commit I could find with that code already has the comment and assert, and they made their way through the patatrack integration in CMSSW and the porting to Alpaka.

Why are we asserting in Alpaka: is this not making the gpu version dramatically slower?

Probably yes. But we also have a lot of extra code being put online for the first time, so I decided it was a good idea to keep the asserts enabled while we commission the new code. We can re-measure the impact of keeping asserts in GPU code, and decide what to do once all crashed have been solved.

We need to emit an error:

a) we do not have (yet) a mechanism to propagate errors from device to host

Technically a device-side assert does just that: it makes the kernel fail, which is caught by an asynchronous exception on the host. But I agree it's not a very nice approach.

b) most probably we can just leave the NaN percolate and catch it later on Host

OK - but is there already some code on the host that would catch it and ...

c) we should just invalidate that track

... do this ?

AdrianoDee commented 5 months ago

OK - but is there already some code on the host that would catch it and ... ... do this ?

Yes, the track would be discarded later in the chain https://github.com/cms-sw/cmssw/blob/3fb8b0edb88bcba34579ca4b9867427862eda51d/RecoTracker/PixelSeeding/plugins/alpaka/CAHitNtupletGeneratorKernelsImpl.h#L524-L532

I checked that it is the case for the event there (when disabling the NaN checks). Note that this is not only a single event, it's actually a single triplet in the whole run.

fwyzard commented 5 months ago

Curious - std::isnan is not constexpr (until c++23), so in principle it should not work in a kernel.

fwyzard commented 5 months ago

c) we should just invalidate that track

by the way - why ?

3 (almost) aligned hits would point to a very high pT track. Why should we invalidate it ?

VinInn commented 5 months ago

a=(0.368802,-3.760513) b=(0.934136,-9.524984) c=(-1.302937,13.285498)

could be printed with full precision using "%a" instead of "%f"? I would like to test if it is FMA that makes the cross product zero or something else

VinInn commented 5 months ago

I added the printf and run on CPU: got no output! Which code is running on a machine w/o gpu?

VinInn commented 5 months ago

edited the file in interface, not in inteface/alpaka.... hope we get rid of all these duplications soon

VinInn commented 5 months ago

found

fastFit: 0=inf 1=inf 2=inf 3=-nan tmp=inf a=(0x1.79a72p-2,-0x1.e1588p+1) b=(0x1.de4706p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

this is on cpu I suppose

fastFit: 0=-0x1.214e85bff4ca4p+28 1=-0x1.c5f78f6c1a45ap+24 2=0x1.22b1d7a771c18p+28 3=0x1.11e163587decap-1 tmp=-0x1.344e5cb57444ap+19 a=(0x1.79a724p-2,-0x1.e1588p+1) b=(0x1.de4704p-1,-0x1.30ccacp+3) c=(-0x1.4d8d4bp+0,0x1.a922ccp+3)

is not a subtle precision issue: the cross product of those three double precision numbers (any combination) is zero (on cpu as well) no matter how one computes them. On cpu the imputs have few bits different

VinInn commented 5 months ago

A track with zero curvature will sooner or later produce a NaN and somebody will reject it. We are not even sure that a very small value will survive further manipulations down-stream: so, unless somebody goes around and make sure C=0 is properly propagated, the best is just to keep the nan. For sure the asserts shall be removed even if exactly ZERO in double precision is always a bit suspicision (still can happen, are only 53bits after all...) and it can happen on cpu as well. In the printout is "plenty" of values that could be "inf" in float

mmusich commented 5 months ago

we still need to look at run 379613 though

for the record, we checked all the available error stream files for that run with the following script [1] in CMSSW_14_0_5_patch2 and found no other crashes.

This particular issue should be dealt with by https://github.com/cms-sw/cmssw/pull/44808 (if accepted).

[1]

```bash #!/bin/bash -ex # CMSSW_14_0_5_patch2 hltGetConfiguration run:379613 \ --globaltag 140X_dataRun3_HLT_v3 \ --data \ --no-prescale \ --no-output \ --max-events -1 \ --input file:converted.root > hlt.py cat <<@EOF >> hlt.py process.options.numberOfThreads = 32 process.options.numberOfStreams = 32 @EOF # Define a function to execute each iteration of the loop process_file() { inputfile="$1" outputfile="${inputfile%.root}" cp hlt.py hlt_${outputfile}.py sed -i "s/file:converted\.root/\/store\/group\/tsg\/FOG\/debug\/240417_run379613\/${inputfile}/g" hlt_${outputfile}.py cmsRun hlt_${outputfile}.py &> "${outputfile}.log" } # Export the function so it can be used by parallel export -f process_file # Find the root files and run the function in parallel using GNU Parallel eos ls /eos/cms/store/group/tsg/FOG/debug/240417_run379613/ | grep '\.root$' | parallel -j 8 process_file ```

mmusich commented 4 months ago

+hlt

issue fixed by https://github.com/cms-sw/cmssw/pull/44808 included in CMSSW_14_0_6
deployed online via CMSSW_14_0_6_MULTIARCHS (intermediate patches were not used online) together with HLT menu GRun 2024 V1.1. The first stable-beams run CMSSW_14_0_6_MULTIARCHS, using the "v1.1.X" HLT menu (see CMSHLT-3164), was run-380306 (Run2024D). No further issues were spotted.

makortel commented 4 months ago

+heterogeneous

cms-sw / cmssw

HLT farm crash in run 379617 (part-2) #44786