Originally posted by **kavmar** January 18, 2024
Hi,
I found a cool feature in the recent MLFlow release where we can monitor and log system resources (GPU/CPU/MEM/net, HDD, ...) during training. I am using it in the Engine based training as follows:
`import mlflow as resource_monitor`
`resource_monitor.set_tracking_uri(mlflow_uri)`
`resource_monitor.set_experiment(experiment_name=exp_name)`
`resource_monitor.set_system_metrics_sampling_interval(interval=2)`
`resource_monitor.start_run(log_system_metrics=True)`
`run_name = resource_monitor.active_run().info.run_name`
and then for validation and training similarly as
`mlflow_handler = MLFlowHandler(tracking_uri=mlflow_uri, experiment_name=exp_name, run_name=run_name, ....)`
`resource_monitor.stop_run()`
This way both resources and training logs go the same experiment and run. In a way, this suffices, but takes particularly for resource_monitor linear approach and not Engine/Event paradigm.
I would love to hear if it make sense to think about enhancing this approach.
Thanks
PS: It might make sense to include this in mlflow integration tutorials
Discussed in https://github.com/Project-MONAI/MONAI/discussions/7404