Training instability with Dice Loss/Tversky Loss

martaranzini commented 4 years ago

I am training a 2D UNet to segment fetal MR images using MONAI and I have been observing some instability in the training when using MONAI Dice loss formulation. After some iteration, the loss jumps up and the network stops learning, as the gradients drop to zero. Here is an example (orange is loss on training set computed over 2D slices, blue is loss on validation computed over 3D volume).

After investigating several aspects (using the same deterministic seed), I've narrowed down the issue to the presence of the smooth term in both the numerator and denominator of the Dice Loss: f = 1.0 - (2.0 * intersection + smooth) / (denominator + smooth)

When using the formulation: f = 1.0 - (2.0 * intersection) / (denominator + smooth) without the smooth term in the numerator, the training was stable and no longer showed unexpected behaviour: [Note: this experiment was trained for much longer to make sure the jump would not appear later in the training]

The same pattern was observed also for the Tversky Loss, so it could be worth investigating the stability of the losses to identify the best default option.

Software version MONAI version: 0.1.0+84.ga683c4e.dirty Python version: 3.7.4 (default, Jul 9 2019, 03:52:42) [GCC 5.4.0 20160609] Numpy version: 1.18.2 Pytorch version: 1.4.0 Ignite version: 0.3.0

Training information Using MONAI PersistentCache 2D UNet (as default in MONAI) Adam optimiser, LR = 1e-3, no LR decay Batch size: 10

Other tests The following aspects were investigated but did not solve the instability issue:

Gradient clipping
Different optimisers (SGD, SGD + Momentum)
Transforming the binary segmentations to a two-channel approach ([background segmentation, foreground segmentation])
Choosing smooth = 1.0 as default here (https://github.com/MIC-DKFZ/nnUNet/blob/master/nnunet/training/loss_functions/dice_loss.py). However, this made the behaviour even more severe and the jump would happen sooner in the training.

The following losses were also investigated

Binary Cross Entropy --> stable
Dice Loss + Binary Cross Entropy --> unstable
Dice Loss (no smooth at numerator) + Binary Cross Entropy --> stable
Tversky Loss --> Unstable
Tversky Loss (no smooth at numerator) --> stable

wyli commented 4 years ago

thanks for this nice report, will look into this (cc @FabianIsensee @ericspod @holgerroth)

FabianIsensee commented 4 years ago

Interesting observation. I must admit that I have never investigated this in depth in nnU-Net. I should also note that while the loss function defaults to smooth=1, nnU-Net actually uses smooth=1e-5 (see here https://github.com/MIC-DKFZ/nnUNet/blob/9524dc33425627d1e4b21336a5202b1bcc8157e5/nnunet/training/network_training/nnUNetTrainer.py#L112).

Have you tried training the 2D configuration of nnU-Net with this dataset? It would be quite interesting in knowing whether the instability happens there as well

martaranzini commented 4 years ago

No, I actually haven't tried and trained nnUnet with my data, but I can look into it in the next days and feedback whether I observe the same behaviour :) (@LucasFidon may find this conversation useful as well)

FabianIsensee commented 4 years ago

When you do, please beware that there is a memory leak on Ubuntu 18 and above with Turing cards. Unfortunately there is nothing I can do about it because this is related to mixed precision training with apex/amp (and also occurs with pytorch autocast). I already opened an issue about this and I hope someone will look into this. If you observe this on your system as well, please let me know. You can get rid of it by training in fp32 with the -fp32 option (you will need more GPU memory).

Let me know if you need any assistance getting 2D data to run. This is not straightforward right now. You basically need to create dummy 3D niftis that have shape 1 in the first axis.

Best, Fabian

wyli commented 4 years ago

I'm trying to replicate this issue with some synthetic data, looks like using sigmoid=True instead of softmax=True is much more stable. @martaranzini are you using the softmax=True option, and could you please try sigmoid=True if possible?

i guess softmax is causing some over/underflow I'll look further into this

martaranzini commented 4 years ago

Hi @wyli I was actually using sigmoid=True already in the experiments I reported in the issue. When using a single-channel output, I put sigmoid=True and softmax=False. Conversely, in the two-channel test I did, I used the softmax instead of the sigmoid. In both cases I would observe the reported behaviour.

We are also working on training the nnU-net on the same data to see if it shows the same behaviour. We are going to report back about it as soon as possible.

wyli commented 4 years ago

ok, thanks for the information, I haven't tested a single-channel case yet, I'll do that as well. for now two-channel + sigmoid seems to be more stable compared with two-channel + softmax. looks like it's independent of the choice of network afaik

martaranzini commented 4 years ago

Ok, I see, thanks! I can try and test two-channel + sigmoid instead of softmax on my data as well and see if it gets more stable. I will let you know.

martaranzini commented 4 years ago

Hi all,

First, apologies for the delay in getting back to you about this issue – we have been running a few experiments to put together a MONAI and nnU-Net comparison and this required quite some time. We hope you will find our results interesting and informative.

@LucasFidon kindly ran a few experiments with nnU-Net and this allowed us to identify some implementation differences. Both MONAI and nnU-Net use the Dice formulation: CodeCogsEqn However, as a default MONAI computes the Dice per element in the batch and then the loss is averaged across the batch (we will refer to this simply as “Dice”). We noticed that in nnU-Net the Dice is instead computed directly as a single value across the whole batch (i.e. not per image). The average is computed only across the channels, not across the batch elements. We refer to this approach as “Batch Dice”. We experimented with both formulations in both frameworks, and also tested for Dice loss and Dice + Cross entropy loss at training, as well as for the use of a single- or 2-channel approach.

The performance of the different trained models on the validation set is reported below (note: they are separated in groups A-E for our own clinical interpretation, but the group separation is not particularly relevant for this issue). The sampling strategy at training is also reported. monai_comparison1

A few specifications:

MONAI – Dice no smooth at numerator used the formulation:
nnU-Net – Batch Dice + Xent, 2-channel, ensemble indicates ensemble performance from 5-fold cross validation at training
NeuroImage indicates a published two-step approach on our dataset, and it is reported just for reference.

We gather two main observations from these experiments:

Training instability: On our dataset, nnU-Net did not present the training instability observed with the default MONAI implementation of Dice. This is also confirmed when known implementation differences were ruled out (nnU-Net – Dice, 2-channel, uniform sampling). Also, nnU-Net generally provides better performance. @FabianIsensee, are there any other implementation differences that could justify our results?
Dice vs Batch Dice: In both frameworks, the Batch Dice implementation clearly outperforms the “normal” Dice computation. This could be an interesting feature to be added in MONAI – happy to open another issue/PR about this.

@wyli: I did retrain also the two-channel with sigmoid instead of softmax. With respect to the single-channel, the gradients do not drop to zero (with either sigmoid and softmax), but I still observe some instability in the loss which I cannot fully explain: monai_comparison2

Looking forward to hearing your comments, and happy to run more experiments to investigate this further!

FabianIsensee commented 4 years ago

Hi @martaranzini , thank you so much for the detailed report! It's a very interesting read and I am happy to see that nnU-Net did not disappoint.

Just so that I think about the results in the correct context (@LucasFidon ) : Did you provide separate 2D slices as training examples? Or did you just run the 2D nnU-net configuration on the 3D images? My guess is the latter, but it would be important to be sure.

The batch dice (this is what I call it as well) is implemented on purpose in nnU-Net, but it is not always active. nnU-Net sets that automatically:

if the patch size does not cover the entire image nnU-Net uses batch dice. Since small patches do not represent the true class distribution anyways we might as well benefit from the increased stability of batch dice. Batch dice is used for all 2D U-Nets or 3d_fullres U-Nets if no 3d_lowres is present
nnU-Net uses 'sample dice' (that is what I call the regular Dice) for all cases where the patch size covers almost entire training cases, so that would be the 3d_lowres U-Net and the 3d_fullres U-Net if no lowres is present.

Note that there are cases where sample Dice is better than batch dice, so replacing it categorically is not a good idea.

I don't know about the MONAI implementation, but using the sample Dice in 2D segmentation tasks in nnU-Net is a bad idea. That is because nnU-Net never optimizes the background class with the Dice (just the CE). What this translates to is a loss function that will not yield any useful gradients on slices where no foreground voxels are present - essentially ignoring these slices in optimization. If you want to run sample Dice with nnU-Net, make sure to set do_bg=True in the SoftDiceLoss :-) (this problem does not exist when pairing the Dice with CE because the CE term saves it ;-) )

In my experiments, using uniform sampling gives pretty much the same results as oversampling foreground. Right now we see uniform sampling in nnU-Net only for a scenario which is also unstable in non-uniform sampling. It would be very interesting to see the nnU-Net performance for uniform sampling with the default loss function as well to ensure that oversampling it not skewing the results.

I am not quite sure I understand what you mean by 'ensemble indicates ensemble performance from 5-fold cross validation at training'. Is that simply the Dice score averaged over the five folds? Or do you have a heldout validation set where you applied the ensemble to?

Since your segmentation problem seems to be 3D - what was your motivation to go with a 2D U-Net? I believe you could get better results by running the 3D U-Net. If you have compute resources to spare you could just run the 3d_fullres config. If your dataset is public I can also do that for you :-)

Let me know what you think!

Best,

Fabian

LucasFidon commented 4 years ago

Hi @FabianIsensee, thank you for the quick reply!

Regarding 3D vs 2D, in fetal MRI the 3D MRI are motion-corrupted stacks of 2D slices. In addition the in-plane resolution is typically several times higher than the resolution across planes (please see [1] for more details). In [1] it has already been shown empirically that 2D CNNs were superior to 3D CNNs for the task of fetal brain extraction. As a result, I have trained nnU-Net only in '2d' mode. For training, the 2D slices were used as input (with an extra first dimension of size 1 as you suggested above).

Regarding the validation dataset used in the figure of @martaranzini, it is a heldout validation set. The nnU-Net ensemble is the ensemble of the five 2D U-Nets trained for the five folds created by nnUNet.

The dataset is not publicly available... But I will train nnU-Net for the default configuration (batch Dice + CE) except I will deactivate the oversampling :)

Do you think that it is worth trying do_bg=True also with batch Dice + CE or does do_bg=False always work better in your experience?

Thanks, Lucas

[1] Ebner, Michael, et al. "An automated framework for localization, segmentation and super-resolution reconstruction of fetal brain MRI." NeuroImage 206 (2020): 116324.

martaranzini commented 4 years ago

Hi @FabianIsensee ,

Thanks for your comments, very interesting, especially in the analysis of when batch Dice is more suitable than sample Dice. I totally agree that it should not replace it categorically, but it would be useful to add an option in MONAI for the user to choose one over the other.

I will leave to @LucasFidon the answers to the nnU-Net experiments as he actually did all the hard part on that :)

I can however reply to the details about our segmentation problem. Our task is brain segmentation of fetal MR images for subsequent super-resolution-reconstruction. Our data is 3D but because of heavy (inter-slice) motion artifacts (and heavy anisotropic resolution), we need to approach it as a 2D problem, where we segment each slice independently. Only at inference we perform a 3D assessment, by computing the Dice for the whole 3D image - and this 3D assessment is reported for the validation set in the figure above. We did run some tests with 3D training but it was clearly underperforming the 2D approach.

In addition, the FOV of our images are quite large and we do have a lot of background-only slices - and we want to be able to correctly predict them as empty. So the comment you raised about the background is spot-on and we will look into greater detail about it. I know for sure we did include the background in MONAI (in the two-channel approaches), but we will check the single-channel and nnU-Net better for this. Re: the single-channel, I think that removing the smooth term at the numerator in the Dice will have the same effect of ignoring the background slices. On the other hand, we have this instability in the Dice loss and in the Tversky loss in the MONAI framework when keeping the smooth term at the numerator that we cannot explain.

Many thanks! Marta

martaranzini commented 4 years ago

Oups, I think we posted about at the same time, thanks for answering the nnU-Net questions, @LucasFidon!

FabianIsensee commented 4 years ago

Hi Lucas,

thanks for your response. In that case it would certainly have been easier to simply plug the 3D volumes into nnU-net and let it handle the slicing itself (it's gonna choose the in plane axis automatically and this way you will also get whole 3D volumes as output). I presume you have generated a manual split file to ensure that the data is split properly during cross-validation? This is very important!

I don't think that evaluating do_bg=True with batch dice + CE is going to give you a substantially different result. I think it is more important to use it in combination with sample Dice (without CE). CE covers up a lot of the potential shortcomings of the Dice loss, which is why we use it in nnU-Net (nnU-Net is supposed to be as robust as possible).

nnU-Net handles anisotropic 3D data very well. Motion artifacts can certainly cause a problem, but I would really encourage you to also try the 3D data and see what happens. You might be surprised. Generally speaking you are of course correct, though. On very anisotropic datasets (for example ACDC cine MRI) the benefits of 3D over 2D are less than they could be.

@martaranzini Your response appeared right while I was writing, so let me address your points in this post as well: As long as you are using batch dice OR an additional CE term (essentially in all the cases where nnU-Net did not fail) you do not need to worry about empty slices (as long as they are also presented during training). That becomes only relevant when sample Dice is used without an additional CE term and without do_bg=True.

Regarding the smooth term in the numerator: Yes I think of it the same way. If there is no smooth term then the numerator and thus the entire loss is just 0 for empty slices. Note that this is a little bit more complicated though:

if you are using softmax and you are also optimizing the background class, this should not be a problem because by optimizing the background you are still affecting the output neurons of foreground (coupled via softmax)
if you are using softmax and not optimizing the background class then this will be a problem
if you are using sigmoid, this will be a problem (no coupling of outputs, so this is always a problem irrespective of whether you optimize the bg or not)

Best, Fabian

wyli commented 4 years ago

Thanks for the analysis @martaranzini from these results for now we could:

[x] add a BatchDice as a separate loss function impl.
[x] add a numerator smooth option to all the relevant losses and defaults to 0.0 ((2.0 * intersection + eps_1) / (denominator + eps_2); eps_1 defaults to 0.0)

Also, not sure whether matching the overall performance is a goal of this ticket, but the gap might come from the fact that the MONAI vanilla UNet doesn't use any type of multi-scale losses or deep supervision -- both the NeuroImage paper and (I think) nnUnet uses a multi-scale loss.

would you help evaluate the recent dynunet (which includes a multi-scale head) @martaranzini? https://github.com/Project-MONAI/tutorials/blob/master/modules/dynunet_tutorial.ipynb

cc @mmarcinkiewicz @pribalta @Nic-Ma

LucasFidon commented 4 years ago

Thanks @FabianIsensee. For the preprocessing of the data, following nnU-Net filenames convention, I have separated the 2D slices of Marta's training set and put them in /path-to-nnU-Net-raw-fetal-data/imagesTr, and I have copied the 3D images of Marta's validation set directly to /path-to-nnU-Net-raw-fetal-data/imagesVal (so nnU-Net gives me 3D output when I run inference for those images). I have used the automatic split of nnU-Net. If I understand well, this implies that Marta's training set = nnU-Net training set + nnU-Net validation set. But all the images from Marta's training set are still used for training by nnU-Net in the ensemble.

I will give a try to nnU-Net in 3d_fullres mode.

FabianIsensee commented 4 years ago

Hi Lucas, if you have split your 3d images into a series of 2d images and you have not accounted for this in a manually designed split file your entire set of cross-validation experiments is not valid. This is because nnU-Net per default expects to be able to treat the training cases independently. If you provide pseudo-2D slices, nnU-Net will just split those for its cross-validation - how would it know to do it differently? They are separate training example for all nnU-Net knows.

nnU-Net will always use all the cases located in the imagesTr folder for its cross validation. nnU-Net will not touch images located in other folder (for example imagesVal).

So unless I misread something, all the cross-validation results you reported above are overly optimistic. Unless you are not reporting cross-validation splits at all (you could also have trained nnU-Net with fold all and then used a single model to predict the validation cases)?

My suggestion would be to give nnU-net the 3D volumes volumes instead and let it do the 2D slicing itself. Then it will also be able to create proper splits. You can still use the 2D configuration of course. This will make everything so much easier. And this will allow you to also use the 3d unet on the data without having to have a separate copy. I suggested using dummy 2D slices only because I was under the impression that you have actual 2D data (where each 2D image is a completely different training case).

@wyli deep supervision only has a very small effect and I would be surprised if you could even see that on this dataset

Best, Fabian

martaranzini commented 4 years ago

Thanks @wyli, the two modifications you mentioned are exactly the changes I added to my "custom" Dice class in MONAI. Also I think the batch dice could be easily integrated in the DiceLoss class. In my case, for the batch dice I simply added a flag in the init() of the DiceLoss and modify the forward() method with:

if self.batch_version:
            # reducing only spatial dimensions and batch (not channels)
            reduce_axis = [0] + list(range(2, len(input.shape)))
else:
            # reducing only spatial dimensions (not batch nor channels)
            reduce_axis = list(range(2, len(input.shape)))

instead of https://github.com/Project-MONAI/MONAI/blob/0aee00bd6f29908f60722392e76015b731746443/monai/losses/dice.py#L131

And sure, I am happy to test the dynunet on our dataset and report back.

wyli commented 4 years ago

Thanks @wyli, the two modifications you mentioned are exactly the changes I added to my "custom" Dice class in MONAI. Also I think the batch dice could be easily integrated in the DiceLoss class. In my case, for the batch dice I simply added a flag in the init() of the DiceLoss and modify the forward() method with:
if self.batch_version:
            # reducing only spatial dimensions and batch (not channels)
            reduce_axis = [0] + list(range(2, len(input.shape)))
else:
            # reducing only spatial dimensions (not batch nor channels)
            reduce_axis = list(range(2, len(input.shape)))
instead of

https://github.com/Project-MONAI/MONAI/blob/0aee00bd6f29908f60722392e76015b731746443/monai/losses/dice.py#L131

And sure, I am happy to test the dynunet on our dataset and report back.

thank you!

LucasFidon commented 4 years ago

Hi Fabian, No, the Dice score results reported by Marta are for the cases in imagesVal (that nnU-Net does not touch during training), so none of the nnU-Net models has seen any of those cases during training! I understand that Marta's training cases are split across the different splits, but it is not a problem here because the results reported are not the cross-validation results :) Sorry if it was not clear (there was a typo in my previous message)

FabianIsensee commented 4 years ago

Hi Lucas,

In that case I owe you my apologies :-) I must have misread that. It would still have been cleaner to split the files properly - like this the ensemble it not really worth much, but I don't think that this matters.

If you are willing to run them, the following experiments would be interesting (this essentially repeats what we have discussed before):

run default nnU-Net without oversampling to confirm that oversampling has no effect on the result (i expect the outcome to be the same)
out of curiosity: run the 3d_fullres configuration and see what happens

@wyli since the splits are OK, there must be something else going on. And yeah - that's going to be very hard to figure out. In my experience it really is the little things that create big differences. Something obvious like the network architecture has much less impact on the model performance than one would think ^^ Also I don't think that the exact loss formulation is at fault. If you want to start searching: using the same learning rate schedule as well as the same number of iterations as nnU-Net would be a good start. After that standardized losses: use CE only for both frameworks (nnunet simply uses pytorch's CE). Data augmentation. Inference strategy. The list is long

Best,

Fabian

wyli commented 4 years ago

@wyli since the splits are OK, there must be something else going on. And yeah - that's going to be very hard to figure out. In my experience it really is the little things that create big differences. Something obvious like the network architecture has much less impact on the model performance than one would think ^^ Also I don't think that the exact loss formulation is at fault. If you want to start searching: using the same learning rate schedule as well as the same number of iterations as nnU-Net would be a good start. After that standardized losses: use CE only for both frameworks (nnunet simply uses pytorch's CE). Data augmentation. Inference strategy. The list is long

Sure, it's probably hard if I go it alone :D as MONAI is a consortium effort, I'm sure as a team we can identify the issue and close the gaps quickly. Thanks for your inputs here!

FabianIsensee commented 4 years ago

True that! If you need my input on something I am always happy to help out. Best, Fabian

tvercaut commented 4 years ago

For the record, this might be interesting to look at: Nordström M., Bao H., Löfman F., Hult H., Maki A., Sugiyama M. (2020) Calibrated Surrogate Maximization of Dice. In: Martel A.L. et al. (eds) Medical Image Computing and Computer Assisted Intervention – MICCAI 2020. MICCAI 2020. Lecture Notes in Computer Science, vol 12264. Springer, Cham. https://doi.org/10.1007/978-3-030-59719-1_27

FabianIsensee commented 3 years ago

Is there somewhere an implementation for this loss? I would like to give it a try but the paper is a bit too mathy for me (They even manage to make the soft dice loss look complicated ;-) ) and I don't have the time to go through all of that right now.

martaranzini commented 3 years ago

Hi all,

@wyli: I finally have the updates on the DynUnet. First, I would like to point out that together with @LucasFidon we identified a huge source of discrepancy in patch size and batch size between the manually selected values in MONAI and the automatically determined ones in nnU-Net. In MONAI I was using a way smaller patch size (roughly a factor of 5), which explains the very large gap of performance.

I did rerun the experiments with the “standard” MONAI UNet, but using the same patch and batch size as determined by nnU-Net. These results are reported in red (first boxplot of each group) in the figure below. For the DynUnet, wrt the MONAI tutorial I only modified the spacing transform to apply it only in the x-y plane, but no change of spacing along z (as our data is heavily affected by out-of-plane motion artefacts). All the training has been performed in 2D. Orange and light orange boxplots are the results with Dice + Xent and Batch Dice + Xent as losses respectively.

Here are the results on our validation sets (not seen at training):

Overall, we managed to reduce the gap substantially compared to our previous results with very minor modifications of existing tools in MONAI. However, using the optimal hyperparameters as determined by nnU-Net played a big role in this.

Note: for dynUnet, both Dice and Batch Dice I kept the original MONAI formulation of Dice: with smooth=1e-5. In this case, it did not show the previously observed instability at training. However, with the "standard" UNet, this formulation would still show the instability, despite the optimised patch and batch size. For that experiment, the smooth term at numerator was set to 0.

Hope this helps, and please let me know if I can help further :)

wyli commented 3 years ago

thanks @martaranzini, for the loss function, I'll create a PR for flexible options of smooth and the "batch version" perhaps adds a new Dice+Xent loss as well. would be great to have your review for the PR :)

FYI @mmarcinkiewicz @pribalta @Nic-Ma @yiheng-wang-nv

FabianIsensee commented 3 years ago

Thank you for the update @martaranzini . Looks like the differences have been resolved :-)

@wyli I would also volunteer for reviewing the PR. I've got a lot of experience with batch Dice and Dice+Ce from developing nnU-Net

Edit: Ah I am too late :-D

saruarlive commented 3 years ago

@FabianIsensee , @martaranzini @wyli , Wonderful work. Could you kindly suggest what set of heuristic rules/steps of nnU-Net should I follow when using monai's dynUnet?

martaranzini commented 3 years ago

Hi @saruarlive, Apologies for the delay in getting back to you. For an optimal outcome with the dynUnet, I used the patch size estimated by the heuristic rules in nnU-Net. If I recall correctly, the spacing is also another parameter that is automatically optimised in nnU-Net but not in dynUnet, so you may need to decide for that too.

FabianIsensee commented 3 years ago

Since dynUNet does not offer all the functionality nnU-Net does I would highly recommend also running nnU-Net to se whether there is a difference in segmentation performance

saruarlive commented 3 years ago

Dear @FabianIsensee, @martaranzini,

Thanks for your reply. Could you mention specifically what are those functionalities differ with the dyUnet? Is the network design the same for both DyUnet and nnUnet (the network design looks the same to me after reading your BRATS 20 paper)?

If so, then I can use the same set of augmentations, a learning rate scheduler from nnU-Net,

Is the pixel spacing fixed for a specific dataset during training, for example for brast 2020 data, it is (1.0, 1.0, 1.0), right.

The reason I am asking because I will adapt from both.

I create a class (compatible with monai framework followed from fabian's nnU-Net), not sure how fast it is,

@martaranzini provide some inputs

from batchgenerators.augmentations.utils import (
    create_zero_centered_coordinate_mesh, 
    elastic_deform_coordinates,
    interpolate_img,
)

from batchgenerators.augmentations.crop_and_pad_augmentations import (
    random_crop as random_crop_aug,
    center_crop as center_crop_aug,
)

class RandElasticDeformd(Randomizable, MapTransform):

    """
    Dictionary-based version :py:class: Saruar created transforms.RandAdjustBrightnessd`.

    See `numpy.random.normal` for additional details.
    https://numpy.org/doc/stable/reference/random/generated/numpy.random.normal.html

    Args:
        keys: Keys to pick data for transformation.
        prob: Probability of Elastic deformation.
        alpha and sigma: parameters.

    """

    def __init__(
        self,
        keys: KeysCollection,
        label_key: str = 'label',
        patch_size: Union[Sequence[int], int] = None,
        alpha:Union[Tuple[float, float], float] = (0., 900.),
        sigma:Union[Tuple[float, float], float] = (9., 13.),
        prob: float = 0.1,
        patch_center_dist_from_border:  Union[Sequence[int], int] = None, 
        random_crop:bool = False,
        border_mode : Union[Sequence[str], str] = ("nearest","constant"),
        border_cval : Union[Sequence[int], int] = (0, 0),
        order_interpolation : Union[Sequence[int], int] = (3, 0),

        #dtype: np.dtype = np.float32,
    ) -> None:
        super().__init__(keys)
        self.label_key = label_key
        self.alpha = alpha
        self.sigma = sigma
        self.prob = prob
        self.patch_size = ensure_tuple_rep(patch_size,3)
        self.patch_center_dist_from_border = ensure_tuple_rep(patch_center_dist_from_border,3)
        self.border_mode = ensure_tuple_rep(border_mode, len(self.keys))
        self.border_cval = ensure_tuple_rep(border_cval, len(self.keys))
        self.order_interpolation = ensure_tuple_rep(order_interpolation, len(self.keys))
        self.random_crop = random_crop

    def randomize(self, data: Optional[Any] = None) -> None:
        self._do_transform = self.R.random_sample() < self.prob

    def __call__(self, data: Mapping[Hashable, np.ndarray]) -> Dict[Hashable, np.ndarray]:
        self.randomize()
        d = dict(data)

        if not self._do_transform:
            return d
        for key in self.keys:
            keyorder = 0
            self.patch_size = (d[key].shape[1], d[key].shape[2], d[key].shape[3]) if self.patch_size == (None, None, None) else self.patch_size
            self.patch_center_dist_from_border = (d[key].shape[1]//2, d[key].shape[2]//2, d[key].shape[3]//2)  \
            if self.patch_center_dist_from_border == (None, None, None) else self.patch_center_dist_from_border

            coords = create_zero_centered_coordinate_mesh(self.patch_size)
            aa = np.random.uniform(self.alpha[0], self.alpha[1])
            ss = np.random.uniform(self.sigma[0], self.sigma[1])
            coords = elastic_deform_coordinates(coords, aa, ss)

            for pi in range(len(self.patch_size)):
                if self.random_crop:
                    ctr = np.random.uniform(self.patch_center_dist_from_border[pi],
                                            d[key].shape[pi + 1] - self.patch_center_dist_from_border[pi])
                else:

                    ctr = int(np.round(d[key].shape[pi + 1] / 2.))

                coords[pi] += ctr

            for c in range(d[key].shape[0]):

                if key==self.label_key:

                    d[key][c] = interpolate_img(d[key][c], coords, self.order_interpolation[keyorder],
                                                self.border_mode[keyorder], cval=self.border_cval[keyorder], is_seg=True)
                else:
                    d[key][c] = interpolate_img(d[key][c], coords, self.order_interpolation[keyorder],
                                                self.border_mode[keyorder], cval=self.border_cval[keyorder])

        return d

FabianIsensee commented 3 years ago

Hi, there are many implementation details that are different. The architecture is somewhat not important. So if you do anything - please also run nnU-Net and verify that you are getting the same segmentation performance in dynUNet. I am not familiar with the implementation of dynUNet. You will need to find the differences yourself ;-) That is only necessary though if you find that the performance between the two differs Best, Fabian

martaranzini commented 3 years ago

Hi, Perhaps @wyli could help point out the implementation differences between dynUnet and nnU-Net?

Regarding the pixel spacing, in MONAI it all depends on what type of transforms you use/need. In my case, I did use Spacingd - which brings all input images to the same pixel spacing - and I chose the pixel spacing according some needs in my application. If you don't apply any transform to change the spacing, then the data will be processed without considering the mm-space information, only the voxel-space. I think that nnU-Net instead determines the optimal spacing from the training set and all images are then resampled to that spacing (@FabianIsensee please correct me if I am wrong).

Also, the batch size as well is another parameter that gets optimised internally by nnU-Net but needs to be manually set in dynUnet. So in our experiments we took the batch size to be the same as the one used by nnU-Net.

I hope this helps.

Alex7Li commented 2 years ago

Hi, I know this is a dead issue but it was helpful for me when I was googling stuff related to the travesky loss so it was helpful for me. After doing some investigation I think I have a interesting observation, maybe it will be interesting to someone else who is googling stuff.

I think the problem could be that when the model is very accurate and predicts an image which does not contain class 1, the term (2.0 intersection + smooth) / (denominator + smooth) becomes smooth / (denominator + smooth) If you take the derivative w.r.t the denominator term, and the denominator is close to 0, you get a gradient of about 1/smooth which is very huge. If this logic is correct, then you can avoid the training instability by using (2.0 intersection + smooth) / (denominator + sqrt(smooth)) instead.

LucasFidon commented 2 years ago

Hi @Alex7Li In case you have not seen it, there is this MICCAI 2022 paper that studies similar problems https://arxiv.org/abs/2207.09521

tvercaut commented 2 years ago

Interesting read, thanks. I only had a quick read for now but in

Tilborghs, S., Bertels, J., Robben, D., Vandermeulen, D., & Maes, F. (2022). The Dice loss in the context of missing or empty labels: Introducing Φ and ε. arXiv preprint arXiv:2207.09521.

they end up setting ε_numer = ε_denom ≈ 1e5 (no typo here, it's not 1e-5) if the sample Dice is used instead of the batch Dice. That's a bit counter-ituitive to me even if there is a decent justification for it in the paper.

As such, I'd be interested to see if the proposed ε_denom = sqrt(ε_numer) rule of @Alex7Li with a small ε_denom achieves similar stabilisation results for the sample Dice case. This is somewhat hinted in the discussion above for the ε_numer = 0 results but the ε_denom = sqrt(ε_numer) rule could also adress some of the issues mentioned in https://github.com/Project-MONAI/MONAI/issues/807#issuecomment-696137618 for the ε_numer = 0 case.

wyli commented 2 years ago

Nice investigations -- when the supervision signal is empty/missing at either sample or batch level, there's nothing to learn, perhaps the relevant gradients should be set to zero (or based on some unsupervised methods such as contrastive loss?).

If that's the case then we should enhance the Dice loss with a customized gradient, because tuning the epsilons is prone to errors...

Project-MONAI / MONAI

Training instability with Dice Loss/Tversky Loss #807