Monitoring system resources during training using MLFlow

Discussed in https://github.com/Project-MONAI/MONAI/discussions/7404

^{Originally posted by **kavmar** January 18, 2024} Hi, I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows: `import mlflow as resource_monitor` `resource_monitor.set_tracking_uri(mlflow_uri)` `resource_monitor.set_experiment(experiment_name=exp_name)` `resource_monitor.set_system_metrics_sampling_interval(interval=2)` `resource_monitor.start_run(log_system_metrics=True)` `run_name = resource_monitor.active_run().info.run_name` and then for validation and training similarly as `mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)` `resource_monitor.stop_run()` This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm. I would love to hear if it make sense to think about enhancing this approach. Thanks PS: It might make sense to include this in mlflow integration tutorials

Project-MONAI / MONAI

Monitoring system resources during training using MLFlow #7405

Discussed in https://github.com/Project-MONAI/MONAI/discussions/7404