Problems on converging ResUneta-d6 multitasking

thimabru1010 commented 4 years ago

Hello again @feevos,

I am using Tanimoto dual loss with complement and it's working with simple ResUnet-a. However, when training on multitasking using Tanimoto on all tasks, only the first epoch seems to be good and then the models early stop on epoch 11. I am using some ISPRS dataset and a batch size of 1. DId u use any lr scheduler or thing that helped you to converge the training? Which batch size did u use?

Thanks for all the help.

feevos commented 4 years ago

Hi @thimabru1010 ,

batch size = ~256, definitely less than 300. No scheduler, you train until performance does not increase for ~100 epochs (then manually reduce learning rate, then restart from best weights). I experimented with a lot of scheduling techniques (including distributed hyperparameter finding of optimal scheduling parameters, with GPyOpt), the best I found is: babysitting. The problem you describe sounds like a bug. You should be seeing training plots like the ones in the manuscript (for validation Tanimoto/mcc - they are strongly correlated). Multitasking speeds up convergence and stabilizes training. Try to switch on/off some of the multiple tasks to find the problem.

Batch size of 1 sounds problematic for batch norm (how does it calculate running statistics with 1 datum)? If you cannot fit it in memory, you are better off with window size of 128 and batch size >6 (per GPU). If you do not have multiple GPUs at your disposal, try this trick to increase batch size.

Best of luck!

thimabru1010 commented 4 years ago

Thanks for the tips!

I Indeed had a bug on the training, but I am still having some convergence problems. I will investigate my label generation. Yes, your right, batch size of 1 is a bit problematic. I could raise it to 4 but that the maximum the GPU supports. Besides the problems, did you monitor just loss function and Tanimoto/mcc or you also calculated accuracy for each task during training? Because accuracy for the other tasks seems to be quiet weird. I am calculating them automatically with Keras.

feevos commented 4 years ago

I usually monitor statistics on the target task (segmentation), with MCC, F1, precision, recall, accuracy and Tanimoto loss (both overall and per class). I recall the performance indicator for the boundaries class, in particular, was lower (in score) but I don't remember exact numbers. If you have convergence problems with multitasking (i.e. worst than single task) it is bound you have a bug. At least that's what I've been seeing in all of my experiments.

Best of luck PS Accuracy for distance transform will make no sense, cause it's a regression problem. Same for the color reconstruction. Only segmentation and boundaries can give you accuracy.

thimabru1010 commented 4 years ago

Hey @feevos .

I managed to plot my predictions. pred0_classes

pred0_color

PS: On color transform I am using your the exactly same code to get the difference: diff = np.mean(hsv_patch - hsv_label, axis=-1) diff2 = 2*(diff-diff.min())/(diff.max()-diff.min()) - np.ones_like(diff) But I didn't understand why did you take the mean of difference and mostly diff2 calculation. At first I was just subtracting both images.

This is ISPRS Vaihingen dataset. I trained with patch size of 256 but still using batch size of 4... I finally could saw that the multitasking is converging cause all the losses are decreasing and MCC of segmentation target task is growing (using tanimoto dual loss here). But I the result of multitasking Overall Accuracy it's still worse than Single semantation task alone. So I am really thinking that maybe the batch size should be the problem, as we talked before. Now I will try to train with a smaller value of patch size. On PSP Pooling layer do you change the filters of Max pooling depending the size the image arrive on middle PSP layer? Because was using fixed filters sizes of 1, 2, 4, 8 but this crashes training with patch sizes samller than 256. I want to know, cause I didn't understand quiet well your recursive approach on your code, but it seems you adapt the pooling filters.

Thanks for all the help.

thimabru1010 commented 4 years ago

Sorry to post again,

I could manage to train with a patch size of 64 and raised the batch size to 8. However, after training some epochs with all the losses weights 1.0 using Tanimoto dual loss boundaries and color losses turned Nan and segmentation Accuracy dropped as hell. I searched and found out that a lot of people have problems getting Nan using dice loss it seems unstable numerically even using a smooth at division as you did. Just want to know if you had these numerical stabilities and if yes what you did to succeed it?

feevos commented 4 years ago

Hi @thimabru1010, great to see its working for you much better now :)

With regards the problem with images smaller than 256, you need to change the depth of the psp_pooling operator, make it depth=3. Note this is not the depth of the network - it relates to the number of subdivisions of the input filters. This should fix your problem.

The value of the accuracy is not always a good descriptor of the quality of the fit, therefore I cannot really comment on that. It is strongly correlated with the mcc performance, and the loss value, but note that this is a multi-objective problem, where you have three performance metrics, that sometimes may disagree on the quality of the prediction, while the overall performance improves (for a human observer).

With regards the nan values, some rare times i get them too (with 256 or 512 input patch size), but i don't know what is the source of the problem and if it relates to the loss. In any case this should be a rare event, rather than the rule.

PS You may be interested on our latest work that is on change detection, but it strongly correlates with semantic segmentation ( https://github.com/feevos/ceecnet ), in particular in this repository there also exist semantic segmentation models

feevos commented 4 years ago

PS @thimabru1010 looking again at your predictions, the boundaries produce a kind of a faint grid,that shouldn't be there, it indicates some kind of a bug perhaps in the way you are using the layer? (missed/different activations at different parts of the network?).

PS2

diff = np.mean(hsv_patch - hsv_label, axis=-1)
diff2 = 2*(diff-diff.min())/(diff.max()-diff.min()) - np.ones_like(diff)

I want to produce a 1-dimensional output so I can compare hsv ground truth and predicted, this is what diff does by getting the mean of differences with respect to all channels (mean is along channel dimension). diff2 is the same thing as diff, but rescaled at [-1,1] range for visualization purposes.

thimabru1010 commented 4 years ago

Yeah but not just Accuracy dropped but MCC got Nan values as well.

nan_values_ps64_tnmt

I did another run with patch size fo 64, batch size of 8, using tanimoto with complement and this time didn't get Nan values (converged until epoch 152). Intresting result here is that this time train and validation curves were quiet away from each other (Other losses as well except for color transform). Want to keep in mind that these tests are being made using no tiles division a stride = 32 and random split. Also, I am doing an oversampling with rotations and flip augmentations as long as I hadn't a lot of images. therefore, this training is not difficult at all. I trained with Unet and simple ResUneta and curves were quiet together.

ps64_tnmt_lr1e-4_good

But results still seems faint but got a small improvement in Accuracy and F1 score in test dataset, after training.

pred0_classes

pred0_color

thimabru1010 commented 4 years ago

As you said maybe I missed some activations at the network. Maybe I miss used padding in Keras on multitasking layers predictions since I used the 'same' padding option of Keras expecting to reproduce the use of padding=(1,1) in my implementation:

Line 126

Line 136

Line 144

I figured out to put zero-padding layers since there is no option of using zero padding on convolutions layers directly.

But I am still in doubt if the problem is the smaller batch size or this zero padding. Do you think zero-padding could cause this?

thimabru1010 commented 4 years ago

PS: About your last work, I've already saw it and it's very interesting!. Actually the purpose of implementing this is to try make a change detection on another dataset with ResUnet-a semantic segmentation. If you want to know more we could talk on LinkeIn: https://www.linkedin.com/in/thiago-matheus-bruno-da-silva-9b645617a/ .

feevos commented 4 years ago

Hi @thimabru1010, great you posted your code. Am pretty sure this is a bugs issue, some comments (at first glance):

you need relu activation after middle psp pooling. You also need relu activation after last psp_pooling. These activations make a difference.
the KL.Upsampling2D works differently than my implementation of upsampling. Keras just repeats the values of the pixels, while I use interpolation followed by a convolution layer.
In your implementation of multitasking, you are not imposing causality, you output 3 layers that depend on the extracted features x_psp, but they do not relate to each other. In our implementation, the algorithm first predicts distance transform, then re-uses that with the extracted features (x_psp) to calculate distance, then re-uses both distance and boundary prediction to calculate the segmentation mask. The color is independent coming from x_psp.
The psp pooling operator you define has differences than our implementation. These are: a. You first apply the convolution to the reduced size and then you do upsampling, the correct order is: pooling --> upsampling to original size, then convolution. b. We are using normed convolutions (conv2d+batchnorm), you are using simple convolutions.
The combine operation you are defining is different than ours.

These are all at a first glance and all of these affect the performance of the network. Hope this helps!

thimabru1010 commented 4 years ago

Hi @thimabru1010, great you posted your code. Am pretty sure this is a bugs issue, some comments (at first glance):

you need relu activation after middle psp pooling. You also need relu activation after last psp_pooling. These activations make a difference.

the KL.Upsampling2D works differently than my implementation of upsampling. Keras just repeats the values of the pixels, while I use interpolation followed by a convolution layer.

In your implementation of multitasking, you are not imposing causality, you output 3 layers that depend on the extracted features x_psp, but they do not relate to each other. In our implementation, the algorithm first predicts distance transform, then re-uses that with the extracted features (x_psp) to calculate distance, then re-uses both distance and boundary prediction to calculate the segmentation mask. The color is independent coming from x_psp.

The psp pooling operator you define has differences than our implementation. These are: a. You first apply the convolution to the reduced size and then you do upsample, the correct order is: pooling --> upsampling to the original size, then convolution. b. We are using normed convolutions (conv2d+batchnorm), you are using simple convolutions.

The combine operation you are defining is different than ours.

These are all at a first glance and all of these affect the performance of the network. Hope this helps!

Oh I really appreciate what you are doing. Thank you for your suggestions!

I took the backbone from another repository that was already in Keras... So I didn't inspect deeply the implementation. Some considerations about your comments:

Well Keras has also interpolation='nearest' and its default. The nearest interpolation works differently in Mxnet? The use of convolution was indeed weird before Upsampling at the Decoder step.
I know you proposed the conditioned multitasking, but for simplification and testing purposes I am beginning with simple multitasking.
a. When you mention pooling you mean MaxPooling --> ConvNormed and then all the other steps upsampling --> ConvNormed?
I wasn't using a BatchNorm layer at the end of combine. Was this the difference? (Remember I am doing Upsample step before combine layer). Going back to the BatchNorm layer. Does make sense using a BatchNorm at the end of combine since its output will enter in ResBlock which begins with a BatchNorm layer? So we're having two subsequent BatchNorm layers.

Here is the corrected model. if you want to check it out.

feevos commented 4 years ago

Hi @thimabru1010

Delete lines 53--59. This will give you the correct order: max pool --> upsample --> conv
The max pooling with kernel size = 1, is the identity operator, you don't need it.
Also, the pooling kernel must be a fraction of the input. The numbers you have in your implementation are working only for a particular input features size to the pooling operator (i.e. 8x8). The kernel size must be (for splitting the original image to 4 subimages) kernel = width//2 and so on. Check line 61. It should be straightoforward to translate the latter code into TF (frankly, I believe it would be easier for you to train on mxnet directly :) ).
With regards to using normed convolutions/activation positions, etc: I've done hundreds of experiments, that I do not remember all. I ended up using the architecture in the manuscript. From experience, even a wrong activation makes a difference (somehow... like the relu in the middle after psppooling). So I cannot explain logically (or mathematically) why BatchNorm was working better in the configuration I ended up (and I also don't have alive in memory all the possible configurations that I tested). I suggest you start from a faithful implementation of this repo, and then modify each thing you want (that may prove to be better!), one by one (so as you know what works or not).

In particular for the combine layer, yes it does make sense, because combine is nothing more than a concatenation followed by a convolution. if you don't use normalization in all steps, you end up with two (or more) convolution operators applied (first fix channel numbers, then second fix channel numbers again), and this may cause problems.

Hope this helps

thimabru1010 commented 4 years ago

Hi @thimabru1010

Delete lines 53--59. This will give you the correct order: max pool --> upsample --> conv

The max pooling with kernel size = 1, is the identity operator, you don't need it.

Also, the pooling kernel must be a fraction of the input. The numbers you have in your implementation are working only for a particular input features size to the pooling operator (i.e. 8x8). The kernel size must be (for splitting the original image to 4 subimages) kernel = width//2 and so on. Check line 61. It should be straightoforward to translate the latter code into TF (frankly, I believe it would be easier for you to train on mxnet directly :) ).

With regards to using normed convolutions/activation positions, etc: I've done hundreds of experiments, that I do not remember all. I ended up using the architecture in the manuscript. From experience, even a wrong activation makes a difference (somehow... like the relu in the middle after psppooling). So I cannot explain logically (or mathematically) why BatchNorm was working better in the configuration I ended up (and I also don't have alive in memory all the possible configurations that I tested). I suggest you start from a faithful implementation of this repo, and then modify each thing you want (that may prove to be better!), one by one (so as you know what works or not).

In particular for the combine layer, yes it does make sense, because combine is nothing more than a concatenation followed by a convolution. if you don't use normalization in all steps, you end up with two (or more) convolution operators applied (first fix channel numbers, then second fix channel numbers again), and this may cause problems.

Hope this helps

Hi @feevos,

Answering your comments:

I thought that when you wrote, but doing the Downsample(MaxPooling) and just after Upsample with the same scale isn't the same thing as Identity Transformation? Actually I think you would be losing information, once MaxPooling and then Upsampling with interpolation isn't a perfect reconstruction. Therefore, to having the advantage of the use of MaxPooling I there should be some convolutions layers to make worth Maxpooling. This was my first idea.
So this one you one is like only applying normed convolutions right? Because both MaxPooling and Upsampling will be of size 1
Yeah, I know it's working only for fixed input features. I am planning to change later, but for my application and Computational resource available I can only train for images with patch size 256 or less, so a Pooling of 8 is the maximum value I can use (which is for input size of 256). Less than that I have to remove pooling layers. But yes, if someone wants to use for a different application this should be changed.

Well, seeing your code I now saw you are not just changing pooling kernel's size = width / 2**i (8, 4, 2, 1), depending on the layer, but also changing the stride, right? This totally changes the pooling. However, seeing again the paper, inside PSP Pooling layer schematic you write: MaxPooling(f/4, 1/layer_i). Does this mean you use a fixed kernel (8/4=2 for input size of 256) for everyone and only change the stride? I am kinda confused after comparing it with your code.

Yes maybe it would be easier, but I never used Mxnet before and was apprehensive to begin due to the deadline I had to deliver my project. Now I have like 3 months and I am not sure what would be faster for me. But all these complications and talking with you made me learn a lot. I really appreciate that.

feevos commented 4 years ago

Hi @thimabru1010, this will help you understand the psp pooling:

%pylab inline

import mxnet as mx
mx.npx.set_np()

xx = mx.np.random.rand(1,1,256,256)
yy = mx.npx.pooling(xx,kernel=(128,128),stride=(128,128))
yy = mx.nd.UpSampling(yy.as_nd_ndarray(), scale=128, sample_type='nearest')

xx =xx.asnumpy()
yy = yy.asnumpy()

fig = figure(figsize=(12,6))
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)

ax1.imshow(xx[0,0])
ax2.imshow(yy[0,0])

and this is what you see:

so once you do pooling, then upscaling, you end up with an "image" of the same size as input, but that is splitted in square blocks (the example is with kernel size = length/2), this tells to the algorithm the "background/context information" in each square block. This is why it is important to upscale the layer after pooling, and before applying convolution. Applying convolution to a 4x4 image, with kernel size = 3, mixes the output of the pooling operation and it does not help.

Yes
With regards to the paper vs code, the code is right for sure because it is the one that I used. I may have missed something in the paper, but not in the code. The code I provide is the same model we used in the paper, the difference is on the training process where I used horovod and distributed computing (not provided in the repository).

Happy I could be of some help, hope this gives a better understanding of the psp pooling. You should definitely change the kernel size from 2,4,8 to length / 2**i (i=0,1,2,3)

feevos commented 4 years ago

PS Another pooling example 1/2, 1/4

xx = mx.np.random.rand(1,1,256,256)
length = xx.shape[-1]//2**1
yy1 = mx.npx.pooling(xx,kernel=(length,length),stride=(length,length))
yy1 = mx.nd.UpSampling(yy1.as_nd_ndarray(), scale=length, sample_type='nearest')

length = xx.shape[-1]//2**2
yy2 = mx.npx.pooling(xx,kernel=(length,length),stride=(length,length))
yy2 = mx.nd.UpSampling(yy2.as_nd_ndarray(), scale=length, sample_type='nearest')

xx =xx.asnumpy()
yy1 = yy1.asnumpy()
yy2 = yy2.asnumpy()

thimabru1010 commented 4 years ago

Now makes much more sense the use pooling --> Upsample. I thought the idea very interesting! I couldn't get this point of a view from the manuscript. By doing this it seems you can separate some content information that its easier for the convolutions extract information. I'll change the stride and remove convolutions and see the improvement. Very clever the idea. Where did you were inspired or it was your idea?

Congrats again for the paper, it is full of good ideas and approaches.

feevos commented 4 years ago

Thank you for your kind words, the original idea and inspiration for the PSP pooling comes from PSPNet, although their implementation is different than ours (as well as the network/usage), from their paper:

We found it beneficial mostly in the middle of the network, rather than the last layer - as it is presented in our manuscript.

thimabru1010 commented 4 years ago

Hey,

I was comparing your ResUnet-a Block with mine but it seems you keep summing (Line 44) the outputs after each dilatation ResBlock. You use the layer_input as input of eas ResBlock (it's ok), but after that, you sum with the result of last dilatation Resblock (Line 43).

What I read in your paper Page 4 seems all dilatation ResBlocks (d1, d2, ..., dn) are independent, and only at the end, you sum the outputs and the input (more like in PSP Pooling but instead of concatenating you sum).

feevos commented 4 years ago

Hi @thimabru1010 , I believe this is a misunderstanding, your implementation is just like ours: if you expand the summation, you will see it is

xfinal = input + resbloc1(input) + resblock2(input)+resblock3(input)+resblock4(input)

what you see in the code is the same thing, with parentheses:

xfinal = ({[(input + resbloc1(input)) + resblock2(input)]+resblock3(input)}+resblock4(input))

All the best, Foivos

thimabru1010 commented 4 years ago

Hi @feevos,

I followed your suggestions and compared your code with my implementation. There was quite a lot of things to change. I made everything and trained on ISPRS Vaihingen dataset with patch size of 256 and stride (step on sliding window) of 16, this is about 15 k images and used no augmentations. Well, it seems the predictions aren't so fainted like before. This is a good point. I have a small feeling that distance transform got a bit better. However, boundary transform is behaving strange (barely can see predictions) and color transformed seems to be redder (wrong predictions) .

I already tried with different amounts of images using different strides but got no improvements. Besides, I am training only with RGB Infrared, not using any other band. I know using other bands could improve, but I was expecting a result quite better. Also, I am using a batch size of 8 and 2 GPUS now (with a patch size smaller I can put batch size up to 16).

Do you think it's the way I am training? Or do you think I still have a bug...?

Really thank you, Thiago

feevos commented 4 years ago

Hi @thimabru1010 , it is definite you have bugs.You should be seeing better performance. Can I please see: Full training script, and in particular how are you calculating the multitasking loss? If possible, the evolution of the Tanimoto with dual loss function (even just the segmentation value, without the boundary and the distance transform), and what numerical values are you seeing. If you don't want to share code publicly, send me pm on linkedIn.

Cheers

feevos commented 4 years ago

PS If you can avoid using keras for the training routine and go to pure TF (version >= 2.0), then I can help more.

feevos commented 4 years ago

@thimabru1010 just saw this blogpost, the code there is in TF and they reproduced our results, you may find useful information (disclaimer: I haven't looked at their code, but their results look good!)

https://medium.com/sentinel-hub/parcel-boundary-detection-for-cap-2a316a77d2f6

mustafateke commented 3 years ago

Hello guys, Thanks for the efforts converting to Keras+TF. I enjoyed reading the previous replies.

Kind Regards

thimabru1010 commented 3 years ago

Hello guys, Thanks for the efforts converting to Keras+TF. I enjoyed reading the previous replies.

Kind Regards

I end up not using that repository anymore. After I finish my graduation thesis I plan to clean some stuff and make the code more usable. Sorry it's so messed up.

mustafateke commented 3 years ago

I look forward to it. One thing I notice Python DL repositories are quite fragmented. Good luck on your graduation.

mustafateke commented 3 years ago

There is an implementation on SentinelHub: https://github.com/sentinel-hub/field-delineation

feevos / resuneta

Problems on converging ResUneta-d6 multitasking #7