Global/Affine registration dramatic overfitting

bashkanov commented 5 years ago

Dear Yipeng,

in my project, I'm trying to follow your approach stacking sequentially the global and local model parts. Currently, I'm facing the problem with global registration, whereby the local registration is fully functional. When I'm learning affine transformation my model is either dramatically overfitted or cannot learn anything. In principle, the model looks like this: conv(4) > conv(8) > conv(16) > conv(32) > dence(12) > affine_ddf with stride of 2. I want to start simple and based on this develop the architecture further. I believe this configuration tends to overfit because of the dense layer with too many parameters. Another solution would be to replace dence layer with global pooling to reduce the number of parameters in this place. But in this case, the model learns nothing. Overfitted case (with a lot parameters in dense): Screenshot_1 Model isn't learning (i.e. global pooling): Screenshot_2 I tried various combinations like to make the model deeper or wider (add more conv filters) and I always stuck in one of these results without any hint of how to improve the model.

Could you please share your experience training the affine transormation? Have you tried to train it separately? If it works together with the local model, then it should work separately? Am I missing something?

I hope I stated my problem clearly. I'd appreciate all the help I can get.

YipengHu commented 5 years ago

The first figure looks certainly like a massive overfitting, assuming the red line is training and the blue is testing? While I can't comment on the global pooling you implemented/added, I do not believe "many parameters" is the reason of overfitting. It also could be a case of divergence - try smaller learning rate. What are the images are you working on?

Another note is that, if you goal is deformable registration, my current position is that I do not see why the affine is needed at all - see the MedIA paper for details. https://reader.elsevier.com/reader/sd/pii/S1361841518301051?token=664A1188693402D5C9CD07942B235FD8EC1A676138365501140BAF3029F736FEDEA6F0349E0338B33A9B1CF47806B7FE

bashkanov commented 5 years ago

Yep. The curve above represents the training data.

I simply used global average pooling after conv(12) layer to keep the model fully-convolutional. The idea behind this approach is presented here: https://stats.stackexchange.com/a/308218. But unfortunately it didn't work out as well.

I will try to reduce the LR. Maybe it will help somehow.

I'm working on TRUS-MRI data (multimodal registration) too. The model for local registration I'm using obviously lacks of global allignment. That is why I decided to build in the affine transformation step.

YipengHu commented 5 years ago

How did you tell it lacks global alignment while it is functional?

bashkanov commented 5 years ago

The local model is trained on centriod centered data, which is only the temporal fix. It is not able to learn global displacement.

YipengHu commented 5 years ago

What happend when you trained on the original data without centroid initialisation?

bashkanov commented 5 years ago

The model cannot reach the same performance as with centroid initialisation.

bashkanov commented 5 years ago

LR tuning didn't impact the test accuracy. It is still pretty low. This problem (affine registration) seemed to be quite trivial, with only 12 params to learn. No idea left what could go wrong. Have you tried to train only affine transformation?

YipengHu commented 5 years ago

surprised you got results so quick! It took me around 2-3 days to get anything converged properly...

On Fri, 9 Aug 2019 at 15:10, Oleksii Bashkanov notifications@github.com wrote:

LR tuning didn't impact the test accuracy. It is still pretty low. This problem (affine registration) seemed to be quite trivial, with only 12 params to learn. No idea left what could go wrong. Have you tried to train only affine transformation?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/YipengHu/label-reg/issues/20?email_source=notifications&email_token=AATX37M6CXU3LDJEMPBF4NLQDV3EPA5CNFSM4IKTOY3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD36ZAFA#issuecomment-519933972, or mute the thread https://github.com/notifications/unsubscribe-auth/AATX37PJQKGA4HENX7LJXFTQDV3EPANCNFSM4IKTOY3A .

YipengHu commented 5 years ago

Yes I have tried with only affine - it works but was more sensitive to the learning rate, but when it does not diverge, it works well.

YipengHu commented 5 years ago

Hi @bashkanov - i'm closing this but feel free to update us what your progress may be.

bashkanov commented 4 years ago

The problem with affine registration is still not resolved. But summation upscaling works well for the global registration so far.

YipengHu commented 4 years ago

OK - let's recap: 1 - using local only works with initialisation? 2 - does not converge using local only without initialisation? 3 - over-fitting when using affine only? Not sure global pooling is the problem here (if implemented correctly ;p) Which ones above are correct? @bashkanov

bashkanov commented 4 years ago

1 - yes, but this is already resolved. The local model became global. 2 - no, it did converge, but there was a space for improvements. 3 - yes. But I don't remember whether I used Dice loss with smoothed labels. After receiving plausible results for global shifts with another approach, I didn't touch the affine network.

YipengHu commented 4 years ago

2 - So it is also a case of over-fitting. But this is most puzzling for me, as there is no reason the "local" net needs simple initialisation like you described - it is just a bias in displacement really. How many training data have you got? how many labels per image pair, approximately?
3 - the same as above, but it should be a bit more sensitive to initialisation and learning rate etc.

you definitely want to use smoothed labels, especially if your labels are for small ROIs.

bashkanov commented 4 years ago

I have approximately 150 image pairs with only whole segmentation masks.

A bit off-topic: In your work (weakly-supervised learning) you wrote that for comparison you have registered images in a "traditional" way:

The B-splines free-form deformation regularised by bending energy (Rueckert et al., 1999), weighting being set to 0.5 for comparison, was optimised with respect to three intensity-based similarity measures, normalised mutual information (NMI), normalised crosscorrelation (NCC) and sum-of-square differences (SSD).

I wonder what similarity measure produced the best result for you and what number of histogram bins you set for MI? I'm trying to register my test cases using elasix but now plausible results for the registration come out of there. I combine AdvancedMattesMutualInformation with TransformBendingEnergyPenalty with the same weights as you. I also use B-spline transformation. Here, the registered case is depicted (I think it would be very unfair to compare them with deep learning results):

What exactly free-form in "B-splines free-form deformation" means? Does it mean that there are no control points defined and each voxel is independent?

YipengHu commented 4 years ago

For the paper, we used the default settings in NiftyReg. It is unfair to compare with deep-learning methods, but I also would not call this "plausible" results as you showed ;) For detailed used in "B-splines free-form deformation", see Reuckert's paper: https://ieeexplore.ieee.org/abstract/document/796284

bashkanov commented 4 years ago

Sorry, it was a typo. now plausible > no plausible

bashkanov commented 4 years ago

Hi Yipeng, sorry for bothering you again. I just want to ask you regarding the implementation details of the random_transform_generator function, which produces the random affine transformation without flipping. In principle, I understand how it works. It computes the linear relation between "corner" points using Least Square methods. E.g. between S and T.

S = [[[-1., -1., -1.,  1.],
      [-1., -1.,  1.,  1.],
      [-1.,  1., -1.,  1.],
      [ 1., -1., -1.,  1.]]]

T = [[[-0.966, -0.959,  -0.91,  1.   ],
      [ -0.99, -0.973,  0.995,  1.   ],
      [-0.968,  0.938, -0.946,  1.   ],
      [ 0.918, -0.957, -0.992,  1.   ]]]

My question concerns the choice of the values in the matrix S. Why S represents exactly these 4 points? For instance, сould we use this set instead?

S = [[[1., 1.,  1.,  1.],
      [1., 1., -1.,  1.],
      [1., -1., 1.,  1.],
      [-1., 1., 1.,  1.]]]

In your work, you wrote that the produced transformation comes without flipping. Here, the intuition is also clear, why it is guaranteed. But I want to ask, is there any fundamental and theoretical background for this topic? Thanks!

YipengHu commented 4 years ago

No problems and Good questions! You can use whatever order of the points you desire, so long the old and new are consistent ;) no flipping only is ensured by no large perturbation on corners (the corner_scale parameter is small)and you don't change the sign of the coordinates - when you do want to flip to augment the data.

@bashKanov I'm closing this issue now. Please open a new one in future for easy to track. Thanks! :)

YipengHu / label-reg

Global/Affine registration dramatic overfitting #20