CUDA-related HLT crashes between run-359694 and run-359764

missirol commented 1 year ago

Note : This issue is hopefully just for documentation purposes, as a possibile solution (#39619) has already been put in place.

Between Oct-1 and Oct-4 (2022), several runs [1] were affected by burst of HLT crashes (a reproducer can be found in [3]).

Stack traces [4] and offline checks [2] pointed to issues with reconstruction running on GPU.

@fwyzard identified an issue in the ECAL-GPU unpacker, and fixed it in #39617 (12_6_X), #39618 (12_5_X), #39619 (12_4_X). With the latter update, crashes were not observed anymore when re-running the HLT offline on data from some of the affected runs.

In parallel to this, it was realised by @Sam-Harper and ECAL experts that the crashes coincided with runs where ECAL suffered from data-integrity issues (see, for example, this DQM plot). On Oct-4, ECAL masked its channels (TT11 EB-6) affected by data-integrity errors, and since then no more online crashes of this kind have been observed thus far (despite HLT still running in CMSSW_12_4_9).

There is a separate open issue (#39568) likely related to the ECAL-GPU unpacker, but it was checked that #39619 does not solve it, so the issue in #39568 is likely different from the issue discussed here.

[1] Affected runs:

[2] Offline checks:

The crashes could be reproduced offline using error-stream files from some of the affected runs, but they were not entirely reproducibile. They were more likely to occur when using more than 1 EDM stream, and would only occur in offline tests if both the ECAL unpacker and pixel reconstruction were offloaded to GPUs. A summary of the checks done offline was compiled by @Sam-Harper in this document.

[3] Reproducer (requires access to one of the online GPU machines, might need to run it multiple times to see the crash at runtime):

#!/bin/bash

# release: CMSSW_12_4_9

https_proxy=http://cmsproxy.cms:3128 hltConfigFromDB --runNumber 359694 > hlt.py

cat >> hlt.py <<@EOF

# # disable ECAL unpacking on GPU (this makes the crash disappear)
# del process.hltEcalDigis.cuda
# del process.hltEcalUncalibRecHit.cuda

process.options.numberOfThreads = 4
process.source.fileListMode = True
process.source.fileNames = [
   '/store/error_stream/run359694/run359694_ls0112_index000079_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0112_index000090_fu-c2b04-35-01_pid2742152.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000141_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000142_fu-c2b02-35-01_pid2465574.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000175_fu-c2b02-16-01_pid2674062.raw',
   '/store/error_stream/run359694/run359694_ls0166_index000195_fu-c2b02-16-01_pid2674062.raw',
]
@EOF

cmsRun hlt.py &> hlt.log

[4] Example of a stack trace (from reproducer) in the attachment hlt.log.

cmsbuild commented 1 year ago

A new Issue was created by @missirol Marino Missiroli.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

missirol commented 1 year ago

assign ecal-dpg,reconstruction,heterogeneous

FYI: @cms-sw/hlt-l2

cmsbuild commented 1 year ago

New categories assigned: heterogeneous,reconstruction,ecal-dpg

@mandrenguyen,@fwyzard,@clacaputo,@simonepigazzini,@makortel,@jainshilpi,@thomreis you have been requested to review this Pull request/Issue and eventually sign? Thanks

Sam-Harper commented 1 year ago

so just keep track of things, it is still happening at a much lower rate (14 crashes in 360090 thanks @trocino for eporting).

The relavent DQM is : https://cmsweb.cern.ch/dqm/online/start?runnr=360090;dataset=/Global/Online/ALL;sampletype=online_data;filter=all;referencepos=overlay;referenceshow=customise;referencenorm=True;referenceobj1=refobj;referenceobj2=none;referenceobj3=none;referenceobj4=none;search=;striptype=object;stripruns=;stripaxis=run;stripomit=none;workspace=Ecal;size=L;root=EcalBarrel/EBIntegrityTask/TTId;focus=;zoom=no;

trocino commented 1 year ago

We also had 8 in 360075. The corresponding files are on Hilton:

/store/error_stream/run360075
/store/error_stream/run360090

thomreis commented 1 year ago

For both runs it is EB+08 that shows integrity errors.

thomreis commented 1 year ago

Are there plans for a patch release to deploy the fix from #39619 ?

fwyzard commented 1 year ago

Given that we have a likely fix since last week, why don't we have a patch release with it ?

perrotta commented 1 year ago

Given that we have a likely fix since last week, why don't we have a patch release with it ?

@fwyzard @thomreis we could build that release even now (it will be a full release, CMSSW_12_4_10). However, yesterday at the joint ops meeting it was said that HLT would have liked to wait for the end of the week instead, when also the other fixes expected could have been ready and merged. We can discuss and agree a possible change of plan with respect to what concluded yesterday later this afternoon at the ORP meeting

fwyzard commented 1 year ago

@perrotta can't we build a CMSSW_12_4_9_patch2 patch release based on CMSSW_12_4_9_patch1 adding only the changes necessary to fix the HLT crashes, namely #39580, #39619, and #39681 ?

thomreis commented 1 year ago

A CMSSW_12_4_9_patch2 now would allow ECAL to unmask EB-06 and prevent an eventual masking of EB+08, so I think that is the better option.

perrotta commented 1 year ago

We will discuss it this afternoon. To get ready for that discussion: why don't you like a full release, which would be far simpler (for us, of course)? Then a patch release on top of it can be quickly prepared with the remaining HLT fixes later this week

fwyzard commented 1 year ago

Because a patch release is something that we should be able to build and deploy in a matter of hours, not days. I.e. once a fix is available, we should be able to request a patch release at the morning meeting and be using it online by the afternoon.

IMHO this year we seem to have lost the mindset of a data taking experiment, where fixes are both frequent and urgent. Instead, it looks to me like we are still operating with a focus on MC and reprocessing mode, where "urgent fixes" arrive on the timescale of days or even weeks, and even for more critical ones, well, Tier-0 can always be paused, and the offline reconstruction is always re-RECO'ed eventually.

I do appreciate that PRs are merged pretty frequently also in 12.4.x branch, although I'm starting to think that that may actually be counterproductive: as soon as a non-patch change is made, it becomes more complicated to build a patch release. The upside is that they are tested in the IBs, though this doesn't always help prevent mistakes from going in production.

So, how do we get back the capability of building patch releases quickly and as needed ?

fwyzard commented 1 year ago

Case in point, if we build a CMSSW_12_4_10 release, the sooner that it can be deployed online is likely on Thursday morning (build overnight, Tier-0 replay and HLT tests on Wednesday, if all goes well deploy on Thursday).

If we could have requested a CMSSW_12_4_9_patch2 release this morning, we should have a mechanism to have it ready by this afternoon :-/

perrotta commented 1 year ago

Andrea, as of NOW a full release could be much faster: if you ask, we can even start building now, and you'll have it ready by this evening, tomorrow at most. (Would you or HLT have asked yesterday instead of saying that it could have been delaied till the end of the week it would have been even faster)

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

branch off from 12_4_9_patch1
add the commits from those three PRs alone, hoping that no rebase is needed
further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

perrotta commented 1 year ago

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

fwyzard commented 1 year ago

(Would you or HLT have asked yesterday instead of saying that it could have been delayed till the end of the week it would have been even faster)

This is a separate discussion than I am trying to have with TSG: why this was not asked. Were I not stuck in bed dealing with Covid, I would have.

Making a patch release with the three PRs you are asking for, would require some amount of manual intervention:

branch off from 12_4_9_patch1

add the commits from those three PRs alone, hoping that no rebase is needed

further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct

I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !

perrotta commented 1 year ago

While you think what is best for you, I start building 12_4_10: it can always get stopped, if you prefere something else.

See https://github.com/cms-sw/cmssw/issues/39694

davidlange6 commented 1 year ago

• branch off from 12_4_9_patch1 • add the commits from those three PRs alone, hoping that no rebase is needed • further hope that all those operations are done without errors, because there are no IBs to check that the result of the above operations was correct I do understand that. And I don't think this is a viable procedure: we do need a way to make patch releases easier than that !

What is actually complex here? You can open a milestone and accept PRs just as with any other release cycle. (At least thats what was done in run2) - the "hope" can then be tested via the release build (if you are doing something that makes this procedure error prone, then likely its best not being a patch...)

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

fwyzard commented 1 year ago

I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target.

I think that, at least for simple PRs like the ones discussed here, it should be enough to do

# backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39580
git-cms-cherry-pick-pr 39580 CMSSW_12_4_X
git push my-cmssw -u backport_39580:backport_39580

# backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39619
git-cms-cherry-pick-pr 39619 CMSSW_12_4_X
git push my-cmssw -u backport_39619:backport_39619

# backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1
git checkout CMSSW_12_4_9_patch1 -b backport_39681
git-cms-cherry-pick-pr 39681 CMSSW_12_4_X
git push my-cmssw -u backport_39681:backport_39681

then push the resulting branches and open the corresponding PRs.

Still, I acknowledge that it's not ideal, and more work than building a release from the current branch.

Maybe we could teach cmsbot to do the dirty work ?

davidlange6 commented 1 year ago

If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess).

This of course should be automated if that automation is thought to be less work to do and maintain than doing what's below. It never seemed that way when I dealt with it. (But ymmv)

On Oct 11, 2022, at 4:23 PM, Andrea Bocci @.***> wrote:

I agree, and I guess the concern is that the part where the PRs that are already merged in the 12.4.x branch need to be re-opened for the new target.

I think that, at least for simple PRs like the ones discussed here, it should be enough to do

backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1

git checkout CMSSW_12_4_9_patch1 -b backport_39580 git-cms-cherry-pick-pr 39580 CMSSW_12_4_X git push my-cmssw -u backport_39580:backport_39580

backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1

git checkout CMSSW_12_4_9_patch1 -b backport_39619 git-cms-cherry-pick-pr 39619 CMSSW_12_4_X git push my-cmssw -u backport_39619:backport_39619

backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1

git checkout CMSSW_12_4_9_patch1 -b backport_39681 git-cms-cherry-pick-pr 39681 CMSSW_12_4_X git push my-cmssw -u backport_39681:backport_39681

then push the resulting branches and open the corresponding PRs.

Still, I acknowledge that it's not ideal, and more work than building a release from the current branch.

Maybe we could teach cmsbot to do the dirty work ?

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.

fwyzard commented 1 year ago

If the original branches were based on 12_4_9_patch1 or older, all this is a no-op and one can just make a second PR with the same developer branch as originally (90%+ of the time its true I guess).

Agreed, and in this case all three are based on CMSSW_12_4_9 or CMSSW_12_4_9_patch1... but at least for two of them, the original branch is gone :-p

Still, one could use the official-cmssw:pull/39580/head etc. branched instead of recreating new ones.

perrotta commented 1 year ago

@missirol can this be considered fixed, and close therefore the issue?

missirol commented 1 year ago

I think so, but imho it would be better if the experts confirm, e.g. @cms-sw/heterogeneous-l2 @cms-sw/ecal-dpg-l2 .

thomreis commented 1 year ago

+ecal-dpg-l2

We still see integrity errors from the HW in the ECAL from time to time but the unpacker seems to handle them gracefully now.

thomreis commented 1 year ago

+ecal-dpg

fwyzard commented 1 year ago

+heterogeneous

missirol commented 1 year ago

please close

cms-sw / cmssw

CUDA-related HLT crashes between run-359694 and run-359764 #39680

backport #39580 from CMSSW_12_4_X to CMSSW_12_4_9_patch1

backport #39619 from CMSSW_12_4_X to CMSSW_12_4_9_patch1

backport #39681 from CMSSW_12_4_X to CMSSW_12_4_9_patch1