SiWECAL-TestBeam / SiWECAL-TB-monitoring

Event-based almost-online monitoring for the SiW ECAL
GNU General Public License v3.0
0 stars 3 forks source link

Ensure having a single monitoring loop only #14

Closed kunathj closed 2 years ago

kunathj commented 2 years ago

Running the monitoring multiple times simultaneously is an anti-pattern. The user should rather (stop and re-)start a single monitoring loop with more workers. If the new loop was started because there is a new ongoing run (and the processing of the last run has not finished yet): Just stop the processing of the last run. You can go back to it when there is no ongoing run anymore. The intermediate state of the event building is saved, so the (time) cost for restarting is small.

A valid objection to forcing a single monitoring loop is the following scenario: On the ongoing run, the monitoring loop is faster than there are incoming new .datXXXX files. Thus, some of the requested worker CPUs are idle. The next comment has a proposal for this scenario.

kunathj commented 2 years ago

Proposal for effective CPU usage when the monitoring loop is faster than the data influx of an ongoing run: If there are less than 2 tasks on the job_queue, start a child EcalMonitoring instance from within the EcalMonitoring of the ongoing run. This is a sort of fallback job queue for the workers to find employment. If there are more than 2 * max_workers jobs on the backlog, gracefully exit the fallback EcalMonitoring and tackle the current run with the full workforce again. When the ongoing run finished, we should also gracefully exit the fallback EcalMonitoring. Just finish the newly-finished run with full work force and be done with it. There is no more needed for sitting around idle.

For finished runs, the whole fallback idea seems less attractive. The price you pay in complicating things is too high to justify the speedup during bottlenecks (start up, run wrap up which are single core).

Note: For the overall max_workers to still be respect, we will need to pass it to the pool creator, on top of only launching max_workers executors. https://github.com/SiWECAL-TestBeam/SiWECAL-TB-monitoring/blob/2f8d6fa7073c45d7b2690184be6b9b566d1cf39b/start_monitoring_run.py#L309

kunathj commented 2 years ago

I still think that this proposal would improve the monitoring logic. But personally I have not really needed it so far. This issue will be closed for now, but could be reopened if there is demand for it.