Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.73k stars 1.05k forks source link

Monitoring system resources during training using MLFlow #7405

Open KumoLiu opened 8 months ago

KumoLiu commented 8 months ago

Discussed in https://github.com/Project-MONAI/MONAI/discussions/7404

Originally posted by **kavmar** January 18, 2024 Hi, I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows: `import mlflow as resource_monitor` `resource_monitor.set_tracking_uri(mlflow_uri)` `resource_monitor.set_experiment(experiment_name=exp_name)` `resource_monitor.set_system_metrics_sampling_interval(interval=2)` `resource_monitor.start_run(log_system_metrics=True)` `run_name = resource_monitor.active_run().info.run_name` and then for validation and training similarly as `mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)` `resource_monitor.stop_run()` This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm. I would love to hear if it make sense to think about enhancing this approach. Thanks PS: It might make sense to include this in mlflow integration tutorials
KumoLiu commented 8 months ago

A tutorial can be found here. https://github.com/chenmoneygithub/mlflow/blob/ca5ce50a5aff042a5d8b365e55b4d97934204253/docs/source/system-metrics/index.rst