cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

Exception using PATMuonProducer in PromptReco_Run382684_Tau #45398

Open mandrenguyen opened 3 months ago

mandrenguyen commented 3 months ago

https://cms-talk.web.cern.ch/t/exception-using-patmuonproducer-in-promptreco-run382684-tau/43269

There is a new issue affecting PromptReco_Run382684_Tau. This is the error message:

----- Begin Fatal Exception 08-Jul-2024 06:52:45 UTC----------------------- An exception of category 'StdException' occurred while [0] Processing Event run: 382684 lumi: 233 event: 451943837 stream: 0 [1] Running path 'write_MINIAOD_step' [2] Prefetching for module PoolOutputModule/'write_MINIAOD' [3] Prefetching for module PATMuonSlimmer/'slimmedMuons' [4] Prefetching for module PATMuonSelector/'selectedPatMuons' [5] Calling method for module PATMuonProducer/'patMuons' Exception Message: A std::exception was thrown. Feature is not set: match2_pullX

So far, it only affects a single job from run 382684

You can find logs and PSet for this job here:

/eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/job/WMT

cmsbuild commented 3 months ago

cms-bot internal usage

cmsbuild commented 3 months ago

A new Issue was created by @mandrenguyen.

@Dr15Jones, @antoniovilela, @makortel, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

mandrenguyen commented 3 months ago

assign reconstruction, xpog

cmsbuild commented 3 months ago

New categories assigned: reconstruction,xpog

@vlimant,@hqucms,@ftorrresd,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks

mandrenguyen commented 3 months ago

FYI @24LopezR @rbhattacharya04 @cms-sw/muon-pog-l2

vlimant commented 3 months ago

would this be coming from https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/XGBoost/src/XGBooster.cc#L92 ?

vlimant commented 3 months ago

referencing #45158 as this includes recent changes in that producer

vlimant commented 3 months ago

in all likeliness there's a nan coming from https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PatAlgos/src/SoftMuonMvaRun3Estimator.cc#L30

float pullX(const MatchPair& match) {
  if (match.first and match.second->hasPhi())
    return dX(match) / sqrt(pow(match.first->xErr, 2) + pow(match.second->xErr, 2));
  else
    return 9999.;
}
mmusich commented 3 months ago

referencing #45158 as this includes recent changes in that producer

I am not sure to follow. That PR entered only CMSSW_14_0_10, see: https://github.com/cms-sw/cmssw/releases/tag/CMSSW_14_0_10. I thought Tier0 is still running in CMSSW_14_0_9 (The replay for the new release not yet done)?

JanFSchulte commented 3 months ago

Pinging @drkovalskyi since this seems to be coming from the producer from the new soft muon MVA.

drkovalskyi commented 3 months ago

Agree with Jean-Roch assessment. If it's reproducible, I'll take a look.

vlimant commented 3 months ago

running the configuration from /eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/job/WMTaskSpace/ on lxplus-9 and cmssw-el8, skipping to the 1212th record does not crash, so the issue is not reproducible apparently. But maybe this would crash on "an actual Tier0 node" ...

drkovalskyi commented 3 months ago

Thanks Jean-Roch for checking. I'll follow up.

gutsche commented 3 months ago

We think this is the same job but was retried and threw a different but similar error:

382684 2215101 PromptReco_Run382684_Tau vocms0313 T1 DE KIT (c01-016-179) CMSSW:FatalException

An exception of category 'StdException' occurred while
   [0] Processing  Event run: 382684 lumi: 233 event: 451943837 stream: 5
   [1] Running path 'write_NANOAOD_step'
   [2] Prefetching for module PoolOutputModule/'write_NANOAOD'
   [3] Prefetching for module SimplePATTauFlatTableProducer/'boostedTauTable'
   [4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
   [5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
   [6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
   [7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
   [8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
   [9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
   [10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
   [11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
   [12] Prefetching for module PATMuonSlimmer/'slimmedMuons'
   [13] Prefetching for module PATMuonSelector/'selectedPatMuons'
   [14] Calling method for module PATMuonProducer/'patMuons'
Exception Message:
A std::exception was thrown.
Feature is not set: match2_pullX

This is the logarchive tarball:

root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/7/7/PromptReco_Run382684_Tau/Reco/0000/5/ede54957-55ec-41b0-8b9f-5cc62f61cff8-160-5-logArchive.tar.gz

It's the same run and event number.

Thanks,

Oli

Dr15Jones commented 3 months ago

@gutsche the two exceptions are really the same. The context information about prefetching isn't really relevant to the underlying problem as multi-threading can lead to different prefetching paths leading up to running the same module.

drkovalskyi commented 3 months ago

The exception is a bit misleading. NaN is found not because the feature was not set, but because it was set to NaN. The failure is not reproducible locally. The isnan check plays a critical role in the code to ensure NaNs do not reach the XGBoost code, which cannot tolerate them. I'll replace the exception with a warning that has a clearer message and return a negative value for the XGBoost discriminator. I will make a PR tomorrow.

@JanFSchulte, I don't think looking for the source of the NaN is worth the effort. The muon is clearly not good if this happens, and we have many places where NaNs are tolerated. This doesn't happen often enough to have any impact on physics results. I'll give it a distinct value so we can study it offline and confirm this.

vlimant commented 3 months ago

Not throwing an exception in case of nan is a good thing indeed. The exception should be thrown if the variable is not set though ; can you accommodate for this ? Chasing an irreproducible nan in muon reconstruction is however worth the effort IMO (not my call though) as this might lead to a better understanding of reconstruction reproducibility and/or architecture dependency (which I think it at play here).

JanFSchulte commented 3 months ago

I agree with the proposed change and that it's probably not worth it to hunt this down in detail.

germanfgv commented 3 months ago

T0 tried the job 10 times running on AMD machines, all of them failures. The job was successful when it ran on an Intel machine. I put the logs from this successful execution here:

/eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/SuccessfulExecution
vlimant commented 3 months ago

we will need O&C expertise on tracking down this architecture dependent behaviour ; not something tau or jme/btv can do by themselves

drkovalskyi commented 3 months ago

While I agree that we should understand recent issues related to architecture, there is no reason to throw an exception in this case. I'll make a PR as soon as I have time.

vlimant commented 3 months ago

not disagreeing. we however have a conundrum : exception to note that there is architecture differences or be blind to it ... do we have a team chasing these things down in O&C ?

drkovalskyi commented 3 months ago

I tried to reproduce the problem on AMD VM at CERN and I had no issues processing it. So it's not Intel vs AMD. It's more involved. So I'll make a physics driven patch to avoid the exception. For the rest I leave it to the core software development to see how to address it. Unfortunately in my experience NaNs are often treated as acceptable numbers.