Project-MONAI / MONAI

AI Toolkit for Healthcare Imaging
https://monai.io/
Apache License 2.0
5.75k stars 1.06k forks source link

training stats event-handler (for printing training losses) #7

Closed wyli closed 4 years ago

Nic-Ma commented 4 years ago

Hi @yanchengnv and @wyli ,

I found a similar example in @ericspod 's notebook example: @trainer.on(Events.EPOCH_COMPLETED) def log_training_loss(engine): print("Epoch", engine.state.epoch, "Loss:", engine.state.output)

About this task, do you mean something like it? Thanks.

ericspod commented 4 years ago

We would something a bit more generalized with optional outputs for end of iteration, end of step, end of train, etc. We ought to be logging to the engine's log object instead of printing to stdout, we can add and printing handler to that log object to do both of course.

Nic-Ma commented 4 years ago

Hi @ericspod ,

Thanks for your suggestion about general StatsLogger. Intead of engine.log, can we put all the useful outputs to engine.state.metrics? We can also get "iteration", "epoch", "max_epochs", "epoch_length", etc. from engine.state. I think engine.state is designed to be an unified API to store useful information for all kinds of event-handlers. If we aligned on this direction, I can try to make a PR based on engine.state. Thanks.

vfdev-5 commented 4 years ago

Sorry for jumping in into your conversation, I just would like to help with that and make more clear on what is provided out-of-the-box in ignite for that:

In my experience, using experiment tracking systems like MLflow or Polyaxon, we can either log to the system via their api (and ignite's wrappers like MLflowLogger or PolyaxonLogger), write events to TensorBoard or simply print values to stdout and this is automatically written to a log file. The first and the second approaches are obviously more interesting if we would like to compare different runs etc.

HTH

Nic-Ma commented 4 years ago

Hi @vfdev-5 ,

Thanks very much for your detailed sharing! I will take a deep dive into your examples.

And @wyli @ericspod @yanchengnv ,

About the usage of Ignite, have we aligned to use only Ignite official code or both official code and the 3rd contrib code? Thanks.

ericspod commented 4 years ago

@vfdev-5 @Nic-Ma I've used the log file for logging just messages and such, the SessionSaver class in ptproto creates a new directory in a given parent directory for every new run and sends the log to a file there along with the checkpoints and saved networks. My subclasses of Engine add extra fields to the state and we could add more things to it, I would think that metrics should only be the output from metric handlers and shouldn't have anything else. Returning to the idea of the session handling if we're saving the whole engine state (or everything without large tensors) then these other things we add will get saved as well.

ericspod commented 4 years ago

@vfdev-5 One thing to mention is that tqdm doesn't play well with Jupyterlab for some reason, I believe it's a known bug. I had written a super primitive text progress bar that works, I don't know if we collectively want to investigate anything else. I really like doing things through Jupyter a lot, so stuff that doesn't rely on tensorboard/visdom is what I would prefer.

fepegar commented 4 years ago

There is a tqdm_notebook that works quite well: https://pypi.org/project/tqdm/#ipython-jupyter-integration

ericspod commented 4 years ago

@fepegar I think that does have issues with Jupyterlab, Jupyter notebook vanilla I think is fine. I don't why but they're different.

fepegar commented 4 years ago

Yes, I've had trouble before on JupyterLab. But I think installing the widgets extension solves it: https://ipywidgets.readthedocs.io/en/latest/user_install.html#installing-the-jupyterlab-extension

vfdev-5 commented 4 years ago

@ericspod I'm also using jupyterlab for development, it provides a cool environment for research/prototyping/testing etc.

so stuff that doesn't rely on tensorboard/visdom is what I would prefer.

However, how do you plan to run and then organize and compare various trainings for the same task ?

ericspod commented 4 years ago

@fepegar I thought I had tried that and it didn't fix the issue, maybe it didn't load correct for me? I'll try again.

@vfdev-5 That is something that I wasn't doing in a great way so definitely we should be targeting ways of supporting lab and tensorboard/visdom.

pdogra89 commented 4 years ago

Yan - Setup time to discuss design choice here.