Open mandrenguyen opened 3 months ago
cms-bot internal usage
A new Issue was created by @mandrenguyen.
@Dr15Jones, @antoniovilela, @makortel, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
assign reconstruction, xpog
New categories assigned: reconstruction,xpog
@vlimant,@hqucms,@ftorrresd,@jfernan2,@mandrenguyen you have been requested to review this Pull request/Issue and eventually sign? Thanks
FYI @24LopezR @rbhattacharya04 @cms-sw/muon-pog-l2
would this be coming from https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/XGBoost/src/XGBooster.cc#L92 ?
referencing #45158 as this includes recent changes in that producer
in all likeliness there's a nan
coming from https://github.com/cms-sw/cmssw/blob/master/PhysicsTools/PatAlgos/src/SoftMuonMvaRun3Estimator.cc#L30
float pullX(const MatchPair& match) {
if (match.first and match.second->hasPhi())
return dX(match) / sqrt(pow(match.first->xErr, 2) + pow(match.second->xErr, 2));
else
return 9999.;
}
referencing #45158 as this includes recent changes in that producer
I am not sure to follow. That PR entered only CMSSW_14_0_10
, see:
https://github.com/cms-sw/cmssw/releases/tag/CMSSW_14_0_10.
I thought Tier0 is still running in CMSSW_14_0_9 (The replay for the new release not yet done)?
Pinging @drkovalskyi since this seems to be coming from the producer from the new soft muon MVA.
Agree with Jean-Roch assessment. If it's reproducible, I'll take a look.
running the configuration from /eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/job/WMTaskSpace/
on lxplus-9 and cmssw-el8, skipping to the 1212th record does not crash, so the issue is not reproducible apparently. But maybe this would crash on "an actual Tier0 node" ...
Thanks Jean-Roch for checking. I'll follow up.
We think this is the same job but was retried and threw a different but similar error:
382684 2215101 PromptReco_Run382684_Tau vocms0313 T1 DE KIT (c01-016-179) CMSSW:FatalException
An exception of category 'StdException' occurred while
[0] Processing Event run: 382684 lumi: 233 event: 451943837 stream: 5
[1] Running path 'write_NANOAOD_step'
[2] Prefetching for module PoolOutputModule/'write_NANOAOD'
[3] Prefetching for module SimplePATTauFlatTableProducer/'boostedTauTable'
[4] Prefetching for module PATObjectCrossLinker/'linkedObjects'
[5] Prefetching for module PATJetRefSelector/'finalJetsPuppi'
[6] Prefetching for module PATJetUserDataEmbedder/'updatedJetsPuppiWithUserData'
[7] Prefetching for module PATJetUpdater/'updatedJetsPuppi'
[8] Prefetching for module PATJetSelector/'slimmedJetsPuppi'
[9] Prefetching for module PATJetUpdater/'updatedPatJetsTransientCorrectedSlimmedPuppiWithDeepTags'
[10] Prefetching for module BoostedJetONNXJetTagsProducer/'pfParticleNetFromMiniAODAK4PuppiCentralJetTagsSlimmedPuppiWithDeepTags'
[11] Prefetching for module ParticleNetFeatureEvaluator/'pfParticleNetFromMiniAODAK4PuppiCentralTagInfosSlimmedPuppiWithDeepTags'
[12] Prefetching for module PATMuonSlimmer/'slimmedMuons'
[13] Prefetching for module PATMuonSelector/'selectedPatMuons'
[14] Calling method for module PATMuonProducer/'patMuons'
Exception Message:
A std::exception was thrown.
Feature is not set: match2_pullX
This is the logarchive tarball:
root://eoscms.cern.ch//eos/cms/tier0/store/unmerged/data/logs/prod/2024/7/7/PromptReco_Run382684_Tau/Reco/0000/5/ede54957-55ec-41b0-8b9f-5cc62f61cff8-160-5-logArchive.tar.gz
It's the same run and event number.
Thanks,
Oli
@gutsche the two exceptions are really the same. The context information about prefetching isn't really relevant to the underlying problem as multi-threading can lead to different prefetching paths leading up to running the same module.
The exception is a bit misleading. NaN is found not because the feature was not set, but because it was set to NaN. The failure is not reproducible locally. The isnan
check plays a critical role in the code to ensure NaNs do not reach the XGBoost code, which cannot tolerate them. I'll replace the exception with a warning that has a clearer message and return a negative value for the XGBoost discriminator. I will make a PR tomorrow.
@JanFSchulte, I don't think looking for the source of the NaN is worth the effort. The muon is clearly not good if this happens, and we have many places where NaNs are tolerated. This doesn't happen often enough to have any impact on physics results. I'll give it a distinct value so we can study it offline and confirm this.
Not throwing an exception in case of nan is a good thing indeed. The exception should be thrown if the variable is not set though ; can you accommodate for this ? Chasing an irreproducible nan in muon reconstruction is however worth the effort IMO (not my call though) as this might lead to a better understanding of reconstruction reproducibility and/or architecture dependency (which I think it at play here).
I agree with the proposed change and that it's probably not worth it to hunt this down in detail.
T0 tried the job 10 times running on AMD machines, all of them failures. The job was successful when it ran on an Intel machine. I put the logs from this successful execution here:
/eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/SuccessfulExecution
we will need O&C expertise on tracking down this architecture dependent behaviour ; not something tau or jme/btv can do by themselves
While I agree that we should understand recent issues related to architecture, there is no reason to throw an exception in this case. I'll make a PR as soon as I have time.
not disagreeing. we however have a conundrum : exception to note that there is architecture differences or be blind to it ... do we have a team chasing these things down in O&C ?
I tried to reproduce the problem on AMD VM at CERN and I had no issues processing it. So it's not Intel vs AMD. It's more involved. So I'll make a physics driven patch to avoid the exception. For the rest I leave it to the core software development to see how to address it. Unfortunately in my experience NaNs are often treated as acceptable numbers.
https://cms-talk.web.cern.ch/t/exception-using-patmuonproducer-in-promptreco-run382684-tau/43269
There is a new issue affecting PromptReco_Run382684_Tau. This is the error message:
----- Begin Fatal Exception 08-Jul-2024 06:52:45 UTC----------------------- An exception of category 'StdException' occurred while [0] Processing Event run: 382684 lumi: 233 event: 451943837 stream: 0 [1] Running path 'write_MINIAOD_step' [2] Prefetching for module PoolOutputModule/'write_MINIAOD' [3] Prefetching for module PATMuonSlimmer/'slimmedMuons' [4] Prefetching for module PATMuonSelector/'selectedPatMuons' [5] Calling method for module PATMuonProducer/'patMuons' Exception Message: A std::exception was thrown. Feature is not set: match2_pullX
So far, it only affects a single job from run 382684
You can find logs and PSet for this job here:
/eos/user/c/cmst0/public/PausedJobs/Run2024F/PATMuonProducer/job_2215101/job/WMT