Open Nic-Ma opened 2 years ago
I'm starting to write bundles which choose new output directories every time the training script is invoked so that runs get placed in unique locations. I want to record the loggers to a log file in that directory but it would also be good to write the current configuration that bundle is using so that one can see what was changed one run to the next. This won't include any auxillary code the bundle uses but it would be most of the way there of keeping track of what environment the run used that generated the data in that directory. This is also lighter weight than tools like mlflow and would suit environments this can't be used in.
Hi @mingxin-zheng @dongyang0122 ,
Let's start to think about this feature request of Auto3DSeg for the next release.
Thanks in advance.
Hi @Nic-Ma @binliunls @dongyang0122 @ericspod @wyli
Here are some thoughts of mine about MLFlow for Auto3DSeg, from two perspectives: user experience and implementation for the release in MONAI v1.1. Thanks!
User Experience:
train_local
and tracking_url
. If user wants to run all trainings locally, train_local
should be True and tracking_url
should be set to 'localhost'. Then MLFlow server will start immediately after the BundleGen/AlgoGen locally. If it is meant to be local, then it will print a message for the user to start the service remotely. It is the user's job to run the server on a remote machine.algo.train()
to start trainings with experiment management ON or OFF.enable_mlflow
, tracking_url
, experiment_name
, params
, metrics
and so on. Optionally, they can use algo._create_cmd()
to see the command to run. Below are some drafts of MLFlow related arguments for the training to take:
enable_mlflow
: use mlflow as backendtracking_url
: use localhost or remote ip address for the mlflow serverexperiment_name
, required by mlflowparams
: a set of keys to log in training (before the iterations)metrics
: a set of keys to log in training (during the iterations)Implementation
ExperimentManager
with MLFlowExperimentManager
as the only subclass in MONAI 1.1 .MLFlowExperimentManager
can initiate the server locally and records where it keeps the database. It can print a helper message if the server will start remotely. (Local server use SQLite as backend?)MLFlowExperimentManager
manages experiment_name
and run_name
MLFlowExperimentManager
manages a list of params
names to log. About log_params
in mlflow:
log_param and log_params, for logging anything that is "onetime" for each experiment-run, including model parameters and other hyperparameters. An error will be thrown if the same parameter name is logged more than once in the same run.
MLFlowExperimentManager
manages another list of metrics
names to log. About log_metrics
in mlflow:
log_metric and log_metrics, for logging numerical values during training. Epoch numbers need to be specified; otherwise, MLFlow will report a conflict error.
max_epochs
in the param buffer, the variable in the train.py
has to be max_epochs
. It can't be total_epochs
or num_epochs
.params
and metrics
during the running of train.py
. If the key is the name of a variable value, then it will trigger the mlflow.log_metrics
or mlflow.log_params
wrapped inside the MLFlowExperimentManager
.Resources
Hi @mingxin-zheng ,
Thanks for the proposal, I agree we should put pictures / texts
logging as P1 tasks.
And to unify the naming, we may need to predefine them in the MLFlowExperimentManager
.
Thanks.
Hi @Nic-Ma , another reason of making it P1 tasks is because of this MLFlow issue. I am doubtful about the support of logging pics and texts in a remote server.
Is your feature request related to a problem? Please describe. To record and track the training experiments clearly, experiment management is a necessary module.
MLFlow
in the Auto3DSeg application.