Questions about MLFlow, LR, and TTA

dzeego commented 1 month ago

:question: Question

Hello @mibaumgartner,

If you may, I have 3 different questions:

Tracking System Metrics in MLflow for nnDetection How can I best set up MLflow's system metric monitoring within nnDetection to track CPU, GPU, and memory usage?
Learning Rate Behavior When Resuming Training When I resume the training of a model (mode: "resume"), does the learning rate stay at the value it was when training stopped, or does it automatically reset to the initial learning rate?
Creating a Torch-Independent Predictor with Test Time Augmentation I'm building a predictor that doesn't rely on the torch library. How would you recommend I write test time augmentation transforms that are independent of the torch.nn.Module class?

Thank you very much in advance.

mibaumgartner commented 1 month ago

Hi @dzeego ,

1) Neither MLFlow nor nnDetection are tracking CPU, GPU and memory usage. If you are interested in these statistics you would need to implement that on your own - sorry (e.g. by implementing a different logger like W&B which should be quite easy since nnDetection is based on lightning)

2) The learning rate resumes where is stopped

3) Adapting the test time augmentation transforms to numpy (or any other backend) should be straight forward. The code for the TTA transforms is located here: https://github.com/MIC-DKFZ/nnDetection/blob/main/nndet/io/transforms/spatial.py (not sure if scatter_ has a direct pendent in numpy from my head but the rest should be simply replacing the torch operation)

Best, Michael

dzeego commented 1 month ago

Hello @mibaumgartner,

Thank you for the quick response!

If you may, I have a few more questions:

During the preprocessing, is it possible to completely avoid the cropping step (in my case cropping is a redundant step as none of the data are cropped) and directly start the normalization and resampling process? Basically going from raw_splitted to preprocessed without having to go through raw_cropped?
Why is inference performed using model_last.ckpt rather than model_best.ckpt?
In the automatic configuration of the nnDetection, is finding the largest patch size privileged over the batch size? and how does the nnDetection go about automatically determining the suitable patch size based on the GPU size? In your point of view, having a larger patch size better than a larger batch size?
In the config file v001.yaml, in the augment.cfg what are the num_threads and num_cached_per_thread? Does num_threads refer to the number of physical cores in this case?
In the same config file v001.yaml, what is the difference between augment_cfg.oversample_foreground_percent and model_cfg.head_sampler_kwargs.positive_fraction?
Do you have a timeline for the next release of nnDetection?

Many thanks in advance, Best!

mibaumgartner commented 1 month ago

Hi @dzeego ,

1) No it is not possible to skip the cropping step for now. 2) It is quite difficult to determine the correct stopping point for the training in the medical domain - the online validation is a rough proxy of the final performance but it exhibits a different ratio of foreground to background patches compared to running inference on the entire patient. If the computational resources allow for it, the training length (i.e. number of epochs) is one of the first parameters to check out for improved performance ;) 3) Batch size is always constant in nnDetection rather than in nnU-Net. The patch size is maximised for the GPU type. For the targeted GPU memory the proposed configuration always works pretty well. For larger GPUs (32GB and above) it is currently difficult to assess the best hyperparameter to increase: i.e. adding more channels to the network, bumping the patch size and/or increasing the batchsize. The parameter will likely vary depending on objects size in the dataset, dataset size etc.. 4) num_threads is the number of threads used for augmentation and should be set to the number of virtual (i.e. threads) which should be used here. The exact number depends on how many users are on the server and how much RAM is available. num_cached_per_thread is the umber of batches saved in the queue before it stops preparing more batches. 5) augment_cfg.oversample_foreground_percent is reponsible for balancing the patches in the batch and augment_cfg.oversample_foreground_percent influences how anchors are sampled for the loss computation. 6) No final timeline yet, currently writing the paper ;)

Best, Michael

dzeego commented 1 month ago

Hello @mibaumgartner,

Thank you very much for the quick replies!

Apologies, but I have one last question please: Having trained an nnDetection model with a patch size say [96, 96, 64], is it possible to change the patch size and bump it up to say [192, 192, 128] during inference? In other words, is it possible to have a dynamic patch size for inference when using an nnDetection model trained with a specific patch size? If so, where can that be changed and configured?

Thanks again! Best regards :)

mibaumgartner commented 1 month ago

Theoretically it is possible to bump the patch size for inference, by simply changing it in the plan_inference.pkl file in the model folder :) I haven't tested it myself but there might be some caveats by doing this e.g. the statistics within batches changes which might lead to strange downstream performance due to the normalisation layers.

Best, Michael

github-actions[bot] commented 1 week ago

This issue is stale because it has been open for 30 days with no activity.

MIC-DKFZ / nnDetection

Questions about MLFlow, LR, and TTA #264

:question: Question