Understanding the terminal output during training

saskra commented 2 years ago

Can anyone help me understand the terminal output during training? I followed these steps: https://github.com/WillBrennan/SemanticSegmentation#training-new-projects I only adjusted minor things.

My output looks like this (with some truncation, line breaks and indentation on my part for clarity):

python train.py --train datasets/my_cat_images_train --val datasets/my_cat_images_val --model-tag segmentation_cat --model-type FCNResNet101
INFO:root:running training on cuda
INFO:root:creating dataset and data loaders
INFO:root:loaded 4114 annotations from datasets/my_cat_images_train
INFO:root:use augmentation: True
INFO:root:categories: ['cat']
INFO:root:loaded 184 annotations from datasets/my_cat_images_val
INFO:root:use augmentation: False
INFO:root:categories: ['cat']
INFO:root:creating dataloaders with 19 workers and a batch-size of 6
INFO:root:creating FCNResNet101 and optimizer with initial lr of 0.0001
INFO:root:creating model with categories: ['cat']
INFO:root:creating trainer and evaluator engines
INFO:root:creating summary writer with tag segmentation_cat
INFO:root:attaching lr scheduler
INFO:root:attaching event driven calls
INFO:root:training...
    INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=2.

INFO:root:epoch[1] - iteration[500/685] loss: 0.145
INFO:root:epoch: 1 - current lr: 0.0001
    INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
    INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
    INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.147 precision: 0.833 recall: 0.880 IoU@0.3: 0.724 IoU@0.5: 0.748 Epoch: 1 
    INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
    INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
    INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.123 precision: 0.873 recall: 0.911 IoU@0.3: 0.781 IoU@0.5: 0.804 Epoch: 1 
    INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:10:52

INFO:root:epoch[2] - iteration[500/685] loss: 0.416
INFO:root:epoch: 2 - current lr: 9e-05
    INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
    INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
    INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.161 precision: 0.854 recall: 0.880 IoU@0.3: 0.745 IoU@0.5: 0.765 Epoch: 2 
    INFO:ignite.engine.engine.Engine:Engine run starting with max_epochs=1.
    INFO:ignite.engine.engine.Engine:Epoch[1] Complete. Time taken: 00:00:09
    INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:00:09
INFO:root:loss: 0.116 precision: 0.872 recall: 0.922 IoU@0.3: 0.787 IoU@0.5: 0.812 Epoch: 2 
    INFO:ignite.engine.engine.Engine:Epoch[2] Complete. Time taken: 00:10:52
    INFO:ignite.engine.engine.Engine:Engine run complete. Time taken: 00:21:45

Here's what I'd like to know about it:

The outputs in the training starting with "INFO:root:" are apparently defined in semantic_segmentation/engines.py, but where do the others ("INFO:ignite.engine.engine:") come from?
Why are two epochs run through as desired, but after the initially correct specification (namely "max_epochs=2"), "max_epochs=1" is then output several times?
Why is "Epoch[1]" output several times during the second epoch?
Why is it stated several times in the middle of an epoch that the epoch or the run is "complete" and a new run would be started?
What does the time information "Time taken: 00:00:09" refer to, because the real runtime is significantly longer?
Why are there intermediate results for the metrics during the epochs, although the function should only be executed when the epoch is finished? And why are there only the results of the second output per epoch in the tensorboard? https://github.com/WillBrennan/SemanticSegmentation/blob/014ea13fa9f35bef08e29b50d01c64893943e899/semantic_segmentation/engines.py#L55
```
Wall time,Step,Value
1667398091.2036688,1,0.8042078614234924
1667398743.5209541,2,0.8115814328193665
```

Possibly part of my questions come from the fact that I had only used Tensorflow and not Torch before. I thank you in advance if someone can make me understand at least part of these oddities.

WillBrennan commented 2 years ago

Hey;

Why are two epochs run through as desired, but after the initially correct specification (namely "max_epochs=2"), "max_epochs=1" is then output several times?

So this happens as we do "one epoch" for each validation dataset at the end of each training epoch. This code also computes the validation metrics on the training set as well to make it easier to see if its overfitting.

What does the time information "Time taken: 00:00:09" refer to, because the real runtime is significantly longer?

This is why you're seeing different times for completion as well, when its referring to the shorter times its how long it takes for an "epoch" on the validation sets.

The outputs in the training starting with "INFO:root:" are apparently defined in semantic_segmentation/engines.py, but where do the others ("INFO:ignite.engine.engine:") come from?

These other logging statements come from within ignite

saskra commented 2 years ago

So this happens as we do "one epoch" for each validation dataset at the end of each training epoch. This code also computes the validation metrics on the training set as well to make it easier to see if its overfitting. This is why you're seeing different times for completion as well, when its referring to the shorter times its how long it takes for an "epoch" on the validation sets.

Thank you for the explanation! So in each epoch the first row of metrics refers to the training dataset and the second to the validation dataset? You could add that in the output to make it more understandable. Isn't it unusual that the result on the validation dataset is better than on the training dataset?

These other logging statements come from within ignite

Can I suppress these outputs somehow, during longer trainings they just clutter up the terminal window?

WillBrennan commented 2 years ago

You could add that in the output to make it more understandable.

It definitely should be; it'll be a fairly simple change here if you want to make a PR? https://github.com/WillBrennan/SemanticSegmentation/blob/master/semantic_segmentation/engines.py#L60

Can I suppress these outputs somehow, during longer trainings they just clutter up the terminal window?

Yes; they can be suppressed. Currently this library is using the default logger from the python standard logging library. We should specify a separate logger for this library.

Isn't it unusual that the result on the validation dataset is better than on the training dataset? Ah to clarify; when the metrics are calculated on the training dataset it is still performing augmentation. This is why the validation dataset performs better.

saskra commented 2 years ago

It definitely should be; it'll be a fairly simple change here if you want to make a PR? https://github.com/WillBrennan/SemanticSegmentation/blob/master/semantic_segmentation/engines.py#L60

Here it is: https://github.com/WillBrennan/SemanticSegmentation/pull/29

Ah to clarify; when the metrics are calculated on the training dataset it is still performing augmentation. This is why the validation dataset performs better.

Ah, now it makes sense! :-)

WillBrennan / SemanticSegmentation

Understanding the terminal output during training #28