Last class missing from summary.csv (and questions)

nicolascedilnik commented 3 years ago

I believe that this line

https://github.com/MIC-DKFZ/nnUNet/blob/058b695d61d34dda7f79cd36ab950a5d3e031653/nnunet/evaluation/model_selection/figure_out_what_to_submit.py#L222

should be changed to be

for c in range(1, num_classes + 1):

to include all classes in the summary.csv file.

Every time I launch of your scripts, I see that I should not hesitate to ask a few questions here, so here I go.

Batch size and epoch duration

If I increase the batch size by manipulating the plans files, the epoch duration scales linearly because of self.num_batches_per_epoch = 250 in network_trainer.py. If I modify num_batches_per_epoch accordingly to keep a similar number of cases by epoch, should I expect the same performance in terms of dice in the final results or am I missing something? Is there a reason why num_batches_per_epoch is hardcoded and not part of the plans file?

Distributing the data augmentation processes on a cluster

The CPU/GPU ratio of the Tesla V100s and A100s is quite low in major cloud providers (AWS and GCP). According to your expertise, do you it's worth implementing distributed workers for data augmentation with something like Ray? Or is there just no way the network transfer won't become the bottleneck?

FabianIsensee commented 3 years ago

The line in figure_out_what_to_submit.py is correct as it is. num_classes includes the background class. So if you have the labels liver and liver tumor then num_classes will be 3 (background, liver, liver tumor)

If I modify num_batches_per_epoch accordingly to keep a similar number of cases by epoch, should I expect the same performance in terms of dice in the final results or am I missing something?

No, this is not how deep learning works. If you play with that you need to adapt the learning rate and pray that it still works.

Is there a reason why num_batches_per_epoch is hardcoded and not part of the plans file?

Yes. "epoch' is kind of a stupid concept when you train with patches. Imagine LiTS. An image is 500x500x500 and we train on 128x128x128 patches. How do you define an epoch in this context? The simplest way that came to my mind was what I did :-) We tried adapting the number of batches per epoch but at the end it didnt matter much so we left it like this.

According to your expertise, do you it's worth implementing distributed workers for data augmentation with something like Ray? Or is there just no way the network transfer won't become the bottleneck?

I do not have any expertise in this regard. We try to make sure to configure our GPU nodes with enough CPU power for data augmentation. You can also try finding a data augmentation setting that is less CPU intensive.

Best,

Fabian

nicolascedilnik commented 3 years ago

Thanks a lot for your answers (and again for sharing the code).

The line in figure_out_what_to_submit.py is correct as it is. num_classes includes the background class.

This is weird. I have a task with 8 classes + background and the last one is missing in this CSV file, but you are right, on another task with 4 classes it's OK. I may have broken something fiddling too much with this. Anyway I don't mind since I am using the summaries' jsons that are much more detailed anyway. Meh!

I was a bit worried you would answer this about the number of cases per batch. I guess I'll use your approach first as a baseline before trying to optimize this for my specific tasks.

Best,

-- Nicolas

FabianIsensee commented 3 years ago

Hi Nicolas, indeed. It is always best to use nnU-Net as it is first and only then start fiddling with it. That way you will know if you broke something ;-) Best, Fabian

MIC-DKFZ / nnUNet

Last class missing from summary.csv (and questions) #633

Batch size and epoch duration

Distributing the data augmentation processes on a cluster