cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

L1RCTProducer Behavior in beginLuminosityBlock #43697

Open wddgit opened 8 months ago

wddgit commented 8 months ago

L1RCTProducer behaves in an unusual way in beginLuminosityBlock. There are configuration options that allow it to call updateFedVector every nth lumi or on the nth lumi (defaults are every 100th lumi and the 10th). This is dependent on EventSetup data, probably there are better and more standard ways to do this.

Our motivation in asking about this behavior is related to a proposed performance improvement to the Framework that will allow a stream to skip a lumi if all the events were already processed by other streams. See draft PR #43522. This will help significantly if one stream encounters a very slow event. That stream will not be able to finish the lumi it is processing or any subsequent lumis until that event is done. It is possible for the system to reach the maximum concurrent lumi limit and all streams be stuck until the extremely slow event completes. With the improvement, other streams will be able to finish subsequent lumis and close them and keep working productively.

For most modules, this improvement does not seem to cause any problems. We have been surveying the code looking for modules that might have a problem when a stream skips a lumi (skips the stream begin lumi and stream end lumi transitions). L1RCTProducer would have a problem if some streams skipped a lumi. For example, say on lumi 400 updateFedVector was called in beginLuminosityBlock but some stream skipped lumi 400. Then that stream would get stale results out of updateFedVector until lumi 500.

One other comment. It is possible this unusual behavior is for performance, but it may be counterproductive. As written, updateFedVector gets called for every stream. I can imagine more standard approaches using the EventSetup where it would only be called once when the underlying EventSetup products change.

Here are some questions. Why is the L1RCTProducer written in this way? Can it be changed so it will still work properly if stream are allowed to skip a lumi if all events were already processed by other streams?

FYI @makortel

cmsbuild commented 8 months ago

cms-bot internal usage

cmsbuild commented 8 months ago

A new Issue was created by @wddgit W. David Dagenhart.

@smuzaffar, @Dr15Jones, @makortel, @rappoccio, @sextonkennedy, @antoniovilela can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

makortel commented 8 months ago

assign l1

cmsbuild commented 8 months ago

New categories assigned: l1

@epalencia,@aloeliger you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 8 months ago

@cms-sw/l1-l2 Any thoughts?

aloeliger commented 8 months ago

@pallabidas could you tag Sascha's github handle if you know it? This should be brought to the calo trigger people.

However @makortel I think that it is possible (emphasis on think though, just based on the history and my personal unfamiliarity with this module) that the code being used here is no longer in use for the current trigger system or for phase 2 RCT/GCT, and isn't a part of any current L1 emulation. This may be legacy code so there shouldn't really be an issue other than any desired backwards compatibility.

pallabidas commented 8 months ago

@asavincms

asavincms commented 8 months ago

I think this is old stage1 code that is not used anymore, I remember this discussion even for new - Layer1 code where it was fixed. Can I see the link to the code in question?

wddgit commented 8 months ago

https://cmssdt.cern.ch/dxr/CMSSW/source/L1Trigger/RegionalCaloTrigger/plugins/L1RCTProducer.cc#101

This is the troublesome part of the code, where it might call updateFedVector every 100th lumi. That will interact badly with say one stream skipping a lumi when the other streams already processed all the events.

wddgit commented 8 months ago

I'd be happy to delete the module entirely. It will still remain in the git history.

asavincms commented 8 months ago

Please go ahead

wddgit commented 8 months ago

Deleting the module is not as easy as I thought.

L1Trigger/RegionalCaloTrigger/python/rctDigis_cfi.py uses the type.

That get imported by these:

L1Trigger/Configuration/python/L1CaloEmulator_cff.py L1Trigger/Configuration/python/ValL1Emulator_cff.py L1Trigger/L1TCalorimeter/python/simDigis_cff.py

Those in turns get imported by these:

L1Trigger/Configuration/python/L1Emulator_cff.py L1Trigger/GlobalCaloTrigger/test/writeGctDigis_cfg.py

DQMOffline/L1Trigger/python/L1TriggerDqmOffline_cff.py L1Trigger/HardwareValidation/python/L1HardwareValidation_cff.py

DQM/L1TMonitor/python/L1TStage2Emulator_cff.py L1Trigger/Configuration/python/SimL1CaloEmulator_cff.py L1Trigger/Configuration/python/SimL1Emulator_cff.py

And quite a few things import those at high levels and appear to be actually in use. I didn't test, but it looks like this may be hard to actually delete and possible something really needs it.

wddgit commented 8 months ago

Here is an alternate suggestion.

There is a configuration parameter named getFedsFromOmds set here:

https://cmssdt.cern.ch/dxr/CMSSW/source/L1Trigger/RegionalCaloTrigger/plugins/L1RCTProducer.cc#42

It is never reset in the C++ code. In almost all places, it is always configured false (at least in the CMSSW repository). The one exception is

https://cmssdt.cern.ch/dxr/CMSSW/source/DQM/L1TMonitor/test/l1temulator_dqm_sourceclient-file_cfg.py#146

It that test configuration ever used? The code that is causing problems is only executed if that configuration parameter is set true.

https://cmssdt.cern.ch/dxr/CMSSW/source/L1Trigger/RegionalCaloTrigger/plugins/L1RCTProducer.cc#96

If the parameter is never true in practice, then I could delete only the code that is executed when it is true and leave the rest unchanged.

wddgit commented 8 months ago

One other thing I noticed. The test configuration is a single stream configuration. If there isn't more than one stream, then there isn't a problem. It is ugly to leave a hole like this for someone to fall into, but if this is really the only place this is used it is OK as is.

asavincms commented 8 months ago

ok for me

wddgit commented 8 months ago

I submitted #43779 to deal with this. It will cause an exception with a descriptive message to thrown if anyone tries to run the module with the configuration option getFedsFromOmds set true. Someday in the future, if no problems are reported someone can clean this up and delete all the unnecessary code.

makortel commented 8 months ago

https://github.com/cms-sw/cmssw/pull/43779 was merged, so I guess we could close this issue?