cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.31k forks source link

DQM `visualization-live` and `visualization-live-secondInstance` crash during HI cosmics #46553

Closed nothingface0 closed 6 hours ago

nothingface0 commented 1 day ago

We noticed that the DQM visualization clients, visualization-live and visualization-live-secondInstance crash during some of the latest cosmic runs, namely: 387579,387557,387556,387555,387552,387548,387546,387544,387541,387539,387531,387338,387240,387235,387212,387209,387207

The exception is:

----- Begin Fatal Exception 29-Oct-2024 15:06:19 CET-----------------------
An exception of category 'NoProductResolverException' occurred while
   [0] Processing  Event run: 387552 lumi: 9 event: 19781392 stream: 3
   [1] Running path 'FEVToutput_step'
   [2] Prefetching for module JsonWritingTimeoutPoolOutputModule/'FEVToutput'
   [3] Calling method for module DeDxHitInfoProducer/'dedxHitInfoCosmicTF'
Exception Message:
No data of type "ClusterShapeHitFilter" with label "ClusterShapeHitFilter" in record "CkfComponentsRecord"
 Please add an ESSource or ESProducer to your job which can deliver this data.
----- End Fatal Exception -------------------------------------------------

More logs here.

First instance was during run 387207 (24/10/2024). Not all cosmic runs lead to this behavior, however. There was no such crash during 387559, for example.

We were using CMSSW_14_1_1 and CMSSW_14_1_4_patch1 at the time of the crashes, with Global Tag 141X_dataRun3_Express_v3. We have not tested yet if this is reproducible with 14_0_X.

Any input is appreciated.

cmsbuild commented 1 day ago

cms-bot internal usage

cmsbuild commented 1 day ago

A new Issue was created by @nothingface0.

@Dr15Jones, @antoniovilela, @makortel, @mandrenguyen, @rappoccio, @sextonkennedy, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 1 day ago

assign dqm

makortel commented 1 day ago

@cms-sw/trk-dpg-l2

cmsbuild commented 1 day ago

New categories assigned: dqm

@antoniovagnerini,@nothingface0,@rvenditti,@syuvivida,@tjavaid you have been requested to review this Pull request/Issue and eventually sign? Thanks

nothingface0 commented 1 day ago

This does not seem to be reproducible with 14_0_15_patch1 and the 140X_dataRun3_Express_v3 Global Tag.

We do get lots of the following however:

%MSG                                                                                                                      
Begin processing the 14th record. Run 387552, Event 48791634, LumiSection 20 on stream 7 at 30-Oct-2024 16:02:51.222 CET  
%MSG-e TooManyClusters:  CosmicSeedGenerator:cosmicseedfinderP5  30-Oct-2024 16:02:51 CET Run: 387552 Event: 48791634     
Found too many clusters (379), bailing out.                                                                               

%MSG                                                                                                                      
%MSG-e TooManyClusters:  SimpleCosmicBONSeeder:simpleCosmicBONSeeds  30-Oct-2024 16:02:51 CET Run: 387552 Event: 48791634 
Found too many clusters (379), bailing out.                                                                               

%MSG                                                                                                                      
mmusich commented 17 hours ago

The exception is:

It's very likely the problem is due to https://github.com/cms-sw/cmssw/pull/45016/ @stahlleiton FYI

We do get lots of the following however:

alas that's normal, see https://github.com/cms-sw/cmssw/pull/46283 for details.

mmusich commented 17 hours ago

https://github.com/cms-sw/cmssw/pull/46563 offers a trivial fix.