Closed kunathj closed 2 years ago
Proposal for effective CPU usage when the monitoring loop is faster than the data influx of an ongoing run:
If there are less than 2
tasks on the job_queue
, start a child EcalMonitoring
instance from within the EcalMonitoring
of the ongoing run. This is a sort of fallback job queue for the workers to find employment. If there are more than 2 * max_workers
jobs on the backlog, gracefully exit the fallback EcalMonitoring
and tackle the current run with the full workforce again.
When the ongoing run finished, we should also gracefully exit the fallback EcalMonitoring
. Just finish the newly-finished run with full work force and be done with it. There is no more needed for sitting around idle.
For finished runs, the whole fallback idea seems less attractive. The price you pay in complicating things is too high to justify the speedup during bottlenecks (start up, run wrap up which are single core).
Note: For the overall max_workers
to still be respect, we will need to pass it to the pool creator, on top of only launching max_workers
executors.
https://github.com/SiWECAL-TestBeam/SiWECAL-TB-monitoring/blob/2f8d6fa7073c45d7b2690184be6b9b566d1cf39b/start_monitoring_run.py#L309
I still think that this proposal would improve the monitoring logic. But personally I have not really needed it so far. This issue will be closed for now, but could be reopened if there is demand for it.
Running the monitoring multiple times simultaneously is an anti-pattern. The user should rather (stop and re-)start a single monitoring loop with more workers. If the new loop was started because there is a new ongoing run (and the processing of the last run has not finished yet): Just stop the processing of the last run. You can go back to it when there is no ongoing run anymore. The intermediate state of the event building is saved, so the (time) cost for restarting is small.
A valid objection to forcing a single monitoring loop is the following scenario: On the ongoing run, the monitoring loop is faster than there are incoming new
.datXXXX
files. Thus, some of the requested worker CPUs are idle. The next comment has a proposal for this scenario.