Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.91k stars 1.09k forks source link

Develop experiment management module #4903

Open Nic-Ma opened 2 years ago

Nic-Ma commented 2 years ago

Is your feature request related to a problem? Please describe. To record and track the training experiments clearly, experiment management is a necessary module.

  1. Identify the typical user stories
  2. Identify the features we should support
  3. ~Design the module and APIs, which can easily support different backends, like MLFlow, AIM, etc.~
  4. Try to apply MLFlow in the Auto3DSeg application.
ericspod commented 2 years ago

I'm starting to write bundles which choose new output directories every time the training script is invoked so that runs get placed in unique locations. I want to record the loggers to a log file in that directory but it would also be good to write the current configuration that bundle is using so that one can see what was changed one run to the next. This won't include any auxillary code the bundle uses but it would be most of the way there of keeping track of what environment the run used that generated the data in that directory. This is also lighter weight than tools like mlflow and would suit environments this can't be used in.

Nic-Ma commented 2 years ago

Hi @mingxin-zheng @dongyang0122 ,

Let's start to think about this feature request of Auto3DSeg for the next release.

Thanks in advance.

mingxin-zheng commented 2 years ago

Hi @Nic-Ma @binliunls @dongyang0122 @ericspod @wyli

Here are some thoughts of mine about MLFlow for Auto3DSeg, from two perspectives: user experience and implementation for the release in MONAI v1.1. Thanks!

Nic-Ma commented 2 years ago

Hi @mingxin-zheng ,

Thanks for the proposal, I agree we should put pictures / texts logging as P1 tasks. And to unify the naming, we may need to predefine them in the MLFlowExperimentManager.

Thanks.

mingxin-zheng commented 2 years ago

Hi @Nic-Ma , another reason of making it P1 tasks is because of this MLFlow issue. I am doubtful about the support of logging pics and texts in a remote server.