Closed nicolascedilnik closed 1 year ago
The line in figure_out_what_to_submit.py is correct as it is. num_classes includes the background class. So if you have the labels liver and liver tumor then num_classes will be 3 (background, liver, liver tumor)
If I modify num_batches_per_epoch accordingly to keep a similar number of cases by epoch, should I expect the same performance in terms of dice in the final results or am I missing something?
No, this is not how deep learning works. If you play with that you need to adapt the learning rate and pray that it still works.
Is there a reason why num_batches_per_epoch is hardcoded and not part of the plans file?
Yes. "epoch' is kind of a stupid concept when you train with patches. Imagine LiTS. An image is 500x500x500 and we train on 128x128x128 patches. How do you define an epoch in this context? The simplest way that came to my mind was what I did :-) We tried adapting the number of batches per epoch but at the end it didnt matter much so we left it like this.
According to your expertise, do you it's worth implementing distributed workers for data augmentation with something like Ray? Or is there just no way the network transfer won't become the bottleneck?
I do not have any expertise in this regard. We try to make sure to configure our GPU nodes with enough CPU power for data augmentation. You can also try finding a data augmentation setting that is less CPU intensive.
Best,
Fabian
Thanks a lot for your answers (and again for sharing the code).
The line in figure_out_what_to_submit.py is correct as it is. num_classes includes the background class.
This is weird. I have a task with 8 classes + background and the last one is missing in this CSV file, but you are right, on another task with 4 classes it's OK. I may have broken something fiddling too much with this. Anyway I don't mind since I am using the summaries' jsons that are much more detailed anyway. Meh!
I was a bit worried you would answer this about the number of cases per batch. I guess I'll use your approach first as a baseline before trying to optimize this for my specific tasks.
Best,
-- Nicolas
Hi Nicolas, indeed. It is always best to use nnU-Net as it is first and only then start fiddling with it. That way you will know if you broke something ;-) Best, Fabian
I believe that this line
https://github.com/MIC-DKFZ/nnUNet/blob/058b695d61d34dda7f79cd36ab950a5d3e031653/nnunet/evaluation/model_selection/figure_out_what_to_submit.py#L222
should be changed to be
to include all classes in the
summary.csv
file.Every time I launch of your scripts, I see that I should not hesitate to ask a few questions here, so here I go.
Batch size and epoch duration
If I increase the batch size by manipulating the plans files, the epoch duration scales linearly because of
self.num_batches_per_epoch = 250
innetwork_trainer.py
. If I modifynum_batches_per_epoch
accordingly to keep a similar number of cases by epoch, should I expect the same performance in terms of dice in the final results or am I missing something? Is there a reason whynum_batches_per_epoch
is hardcoded and not part of the plans file?Distributing the data augmentation processes on a cluster
The CPU/GPU ratio of the Tesla V100s and A100s is quite low in major cloud providers (AWS and GCP). According to your expertise, do you it's worth implementing distributed workers for data augmentation with something like Ray? Or is there just no way the network transfer won't become the bottleneck?