bonlime / keras-deeplab-v3-plus

Keras implementation of Deeplab v3+ with pretrained weights
MIT License
1.35k stars 428 forks source link

Fine tuning this model #56

Open Meight opened 5 years ago

Meight commented 5 years ago

Has anyone been able to successfully fine tune this model at all and, say, from Xception only pretrained on ImageNet?

After three weeks of tweaking and exploring, a good dozen of different loss functions and many more runs with a wide range of hyperparameters (including around those of the original paper), I still can't get the model to even overfit on a small batch from Pascal VOC raw dataset. Consequently, I haven't even been able to reproduce the original paper's results by fine tuning this repo's model so far.

I triple checked and unit tested my preprocessing pipeline, which in turn is just copy/pasted from the original repo, and here's the kind of results I get during training phase:

Feature maps during training

(bottom right most picture is just the argmax over all classes.)

The model does converge toward the same loss value when using pixelwise cross-crossentropy with logits (tried all the possible variations of that, whether by adding a softmax activation in the model or by using TF's native function tf.nn.softmax_cross_entropy_with_logits_v2) with different hyperparameters but it doesn't even begin to perform proper segmentation. I've also tried using @bonlime 's cost function as shared in this reply and several variations of soft dice loss but results aren't any better.

Plotting the different feature maps shows I've successfully loaded the weights of Xception pretrained on ImageNet (the model can totally discriminate objects across images), so this is not a problem.

I'm starting to seriously doubt this model is actually trainable or tunable as is so I'd be curious to hear if anyone got to train it before I dive into its detailed implementation.

bonlime commented 5 years ago

@Meight You raised a very good point. After implementing this model I also tried very hard to fine-tune it, but the results were unsatisfactory bad. I stopped trying at the beginning of the summer. Are you aware of a Keras problem with fine-tuning? Maybe this is the reason why it's impossible to tune this model. http://blog.datumbox.com/the-batch-normalization-layer-of-keras-is-broken/

bonlime commented 5 years ago

I've managed to successfully fine-tune models from this repo: https://github.com/qubvel/segmentation_models, maybe you can use them as well.

Meight commented 5 years ago

Thank you for the reply! Although I spent so much time on this for no useful result, I'm kind of glad to learn it's not just a stupid mistake I kept missing.

I should have said in my initial post that I came across that story of broken batch normalization — which is kind of crazy to be honest, but that's another debate —, but I wasn't so sure as this issue hasn't occurred in other Keras models I've tried to fine tune in the past. That could definitely be at least one of the problems of this model though.

I discovered the repository you linked only a few days ago and I still have to adapt it to our workflow. I'm glad to learn you managed to fine tune these models. On a side note not related to the current repo, I noticed the models are implemeted using keras instead of tf.keras though. Have you tried/been able to run these models onto multiple GPUs?

I would suggest you update the readme of this repository as to tell people the proposed implementation couldn't be trained or fine tuned as far as we know and that it's only valid for inference for now. Hopefully we can spare people a lot of wasted time if they're not willing to troubleshoot it themselves. I'll submit a pull request for that, if you like.

Thank you again for the reply!

pluniak commented 5 years ago

@Meight @bonlime Have you tried finetuning the whole model or just the last couple of layers? In the link posted by bonlime they say that the problem stems from the fact that frozen batch normalization layers in Keras are not really frozen. If that really is the reason, finetuning should work when no layers are frozen. This actually matches my experience with finetuning Inception V3 for classification in Keras: Poor results when finetuning only the last layers; great results when finetuning all layers. Of course this works only if enough training data is available for finetuning.

Meight commented 5 years ago

@pluniak I tried both cases and results and each time results were ridiculously poor. I grabbed a native TensorFlow version of DeepLab v3+, used the exact same preprocessing and quickly got results close to those of the paper. My conclusion is that there was definitely something wrong with the model in this repo, but I stopped wasting time investigating it as soon as @bonlime confirmed he had been having similar issues.

Besides, state of the art for semantic segmentation evolved quite significantly since this model was published, and there exist other alternatives that perform about equally. There was virtually no interest for my research to invest time on this.

pluniak commented 5 years ago

@Meight Many thanks for pointing this out. Surely you saved me a lot of time!

May I ask which other models have evolved since them that performed equally well for you? I'm especially interested in models available in Keras or TF. Based on the Pascal Voc leaderboard, DeepLab V3+-based models still seem to be state of the art:

http://host.robots.ox.ac.uk:8080/leaderboard/displaylb.php?cls=mean&challengeid=11&compid=6&submid=6103#KEY_FCN-8s-heavy

rauldiaz commented 5 years ago

Hi,

I was able to fine-tune this network from pre-trained weights a few months ago. I did nothing special, just loaded the model with the pre-trained pascal voc weights and hit train. The only thing in my case is that the number of classes is 120, so the last layer is definitely different. Other than that, the network trains and smoothly converges with great performance.

trungpham2606 commented 5 years ago

@rdiazgar can you show me some of your results ? Iam intending to fine-tune this repo's model but was hesitate when reading the author's readme. ?

rauldiaz commented 5 years ago

@trungpham2606, sorry but I'm afraid I can't show you any results, as they are currently submitted to a conference and hence I must keep them confidential.

What I meant to say with my post is that I certainly had no problems using this network with pretrained weights and fine-tune it for a different dataset (KITTI). In my case, I just loaded the deeplab model with the 'pascal voc' weights with a different number of categories to classify (120 labels). Then I simply followed standard keras training with a custom data generator to feed the network and opted out by assigning a small learning rate value (1e-3), except for the last layer, which had a lr value 10x larger (1e-2). This was my fine-tuning strategy and it certainly has worked without any problems so far.

I was also surprised to recently see the README section claiming it can't be fine-tuned. Perhaps they refer to other strategies for fine-tuning, like freezing all but the last layers. I can only say that in my experience, I have not encountered any problems using this network, either training from scratch or fine-tuning from pre-trained weights.

Raul

trungpham2606 commented 5 years ago

@rdiazgar oh. First, thank you for your quick response. I will try to fine tune this model according to your fine-tune pipeline. Best Trungpham

duchengyao commented 5 years ago

downgrade from tensorflow 1.11 1.12 to 1.10 might solve the problem, or not using tf.keras.

hfurkanbozkurt commented 5 years ago

@Meight I am having the same problem. I can tune it a little bit but the accuracy is very bad (less than 0.5) even after a good amount of training time. Did you manage to get at least more than 0.5 accuracy?

Meight commented 5 years ago

@hfurkanbozkurt This is about the range I was able to reach too (~0.47-0.48). When fine tuning the pure TF implementation I have now I was able to reach results close to those of the paper. I have no clue what was wrong with my pipeline when I tried using this Keras implementation since it works flawlessly with the TF implementation with no modification whatsoever.

Seeing some people commenting here that they could fine tune it successfully baffles me since there also seems to be many people whom haven't been able to and I spent about three weeks on this and probably checked every single line of code 10 times. This will remain a mystery as far as I'm concerned... Good luck if you keep working on it!

kritiyer commented 5 years ago

I was successfully able to retrain on my custom dataset from the pre-loaded weights (I haven't tried fine-tuning the decoder only).

After combing through the issues on here, here is a list of changes I made: 1) labels must have shape (image_size, image_size, num_classes), unlike TF implementation where labels are (image_size, image_size) 2) use the preprocess_input() from model to scale input images to have values [-1,1] 3) Add sigmoid activation to the last layer of the model (I have a binary segmentation problem, but I think softmax should work too?) 4) don't use any data augmentation from ImageDataGenerator()

I hope this helps someone!

trungpham2606 commented 5 years ago

@kritiyer can you provide some result's images you get :3

kritiyer commented 5 years ago

@trungpham2606 I'm working with medical image data so I'm not comfortable posting the images here, but I promise it's working! I did have to use the datumbox keras fork for BatchNormalization=False to work properly and give decent results: https://github.com/datumbox/keras/tree/fork/keras2.2.4

wave-transmitter commented 5 years ago

@bonlime @Meight Hello, just to make it totally clear, is it possible to train end-to-end a model (without any frozen layers) with voc dataset weights initialization? If not, do you have any idea why this is happening?

@kritiyer @rdiazgar Can you please refer some results in terms of mIoU? Please elaborate a bit more on the steps you followed to train the model? E.g. why should not someone use ImageDataGenerator()?

rauldiaz commented 5 years ago

Hi @wave-transmitter ,

Yes, it is possible to train end-to-end this model without any frozen layers. I have successfully used this model with the mobilenetv2 and xception backbones either from scratch, from the pascal-voc weights, and even from the cityscapes weights (see #67). The dataset that I used to train my model is not pascal voc, but Kitti.

Unfortunately, I cannot share any results as of now because my work is under a conference confidentiality policy. I will certainly post some results when the conference proceedings become public.

In my personal case, I instantiated the model with or without the pre-trained weights, never froze a layer, and trained the model via a custom image data generator that feeds the images (normalized by 1./255) and their corresponding ground truth values. I did not use the ImageDataGenerator available in Keras, but I see no reason why this should be the problem.

Best, Raul

kritiyer commented 5 years ago

@wave-transmitter Hello, I also successfully trained using both Mobilenet and Xception (from the Pascal weights), and was able to fine-tune the decoder as well as train from scratch with frozen batch normalization layers (I don't have enough GPU memory to train the batch normalization layers). So far the best Dice score I got for a binary classification problem is 0.97.

I used an ImageDataGenerator to feed in my data because it was too large to load in memory, but if I used any of the data augmentation arguments (rotate, shear, flip, etc) I got garbage results and I'm not sure why. I listed the steps I took to train in my comments above. I'm using tensorflow-gpu 1.10 and keras 2.2.4 (datumbox fork, linked above).

Licini commented 5 years ago

hi @rdiazgar ,

Would you mind to also share which optimizer and loss function you were using? Thanks in advance!

rauldiaz commented 5 years ago

Hi @Licini,

Sure. I simply used SGD with momentum=0.9, and learning rate of 0.001. The loss is cross-entropy.

wave-transmitter commented 5 years ago

Thank you both for your detailed answers.

@rdiazgar Is it possible to share your model's accuracy in terms of IoU? No need to share inferenced results. Moreover, for how many epochs did you train end-to-end the model and which was the selected batch size?

@kritiyer Can you please also let us know about your choices regarding the optimizer, the learning rate and the batch size? Similarly, for how many epochs did you train your model?

rauldiaz commented 5 years ago

Hi @wave-transmitter,

Truth being said, I am not using this model for semantic segmentation, so I don't have any quantitative measure for intersection over union. I am training this model for monocular depth estimation.

I trained the model for about 30 epochs with a batch size of 4, which is about 300k iterations for the KITTI training set. The input images are random crops of 375x513 pixels.

Raul

Licini commented 5 years ago

@rdiazgar Thanks for the sharing! I was able to retrain a simple two classes version using mobilenetv2, no frozen layers. And It worked pretty well. For anyone who's interested. I was using binary cross-entropy, one object class and background class. My dataset was about 8k imgs without any augmentation. It trained for 10 epochs with batch of 8. Didn't have any IoU measurements yet, but it at least worked for my eyes.

pluniak commented 5 years ago

@kritiyer @rdiazgar @Licini Thanks for your input! Can you please tell which versions of TF/Keras you were using?

Philipp

Licini commented 5 years ago

sure @pluniak , I was using keras 2.2.4 with tensorflow-gpu 1.8.0

rauldiaz commented 5 years ago

@pluniak

I used keras 2.2.4, and tensorflow-gpu 1.9.0 in one machine and 1.12.0 in another one.

pluniak commented 5 years ago

I have also successfully fine-tuned this model. I did nothing special: TF.1.13.1-GPU, Keras 2.2.4, binary_cross_entropy, Adamax(default params), labels_shape(height,width,no_classes). Keras.ImageDataGenerator and class_weights also work. I passed in numpy arrays. Converges quickly with reasonable performance.

@kritiyer @rdiazgar @Licini One thing that suprised me though is that there is no sigmoid or softmax activation in the last layer, so output values range from -40 to +10 in my case (1 class only). Classifying these values by >.5 gives me better results though than when adding sigmoid activation after the last layer, because rarely any output values are >.5 after sigmoid activation. Did anybody else experience the same tendency towards negative output values? Where does this come from? Training longer on my limited training set doesn't help. I'm dividing pixel values by 255. Interesting is also the fact that IOU/Jaccard on validation data is on the same level as for training data. The model converges quickly but doesn't overfit at all. Any explanations for this? Is it possibly a model bias problem? I'd be glad for some comments ... :-)

rauldiaz commented 5 years ago

Hi @pluniak ,

Regarding the lack of activation in the last layer, I believe that this is just for convenience. For instance, if you want to classify your pixels via a Softmax function, all this function does is to turn raw output logit values and turn them into a probability distribution (probits). However, this is only useful from a training point of view, because these probits are used for computing the loss function (e.g., cross-entropy). However, at test time, you only care about which output logit has the higher value (argmax), and you don't need to call Softmax to do that. Plus, by not using Softmax at test time, you save some computation costs because exponentials and logarithms are quite expensive operations.

If you check Keras' docs and code, you'll see that most of the loss functions defined have an optional parameter named 'from_logits' that takes into account exactlty that: when True, the loss calls for softmax before computing the loss; when False, it assumes the network's last layer includes already a softmax call.

Best Raul

pluniak commented 5 years ago

@rdiazgar Thanks. Makes sense :-)

trungpham2606 commented 5 years ago

@rdiazgar hello bro, I want to ask you about the pre-processing part. Did you normalize the images to [-1, 1] or any other ranges ? I had tested by normalizing images to be in range [-1, 1] then the results i got were so poor.

rauldiaz commented 5 years ago

Hi,

The range [0,1] worked out the best for me. You're right, the [-1,1] got me worse results.

trungpham2606 commented 5 years ago

@rdiazgar
Hello bro, I saw in the train.py (old version), the author had provided codes to load .npy weights. I dont know what are the differences between setting 'weights' = 'pascal_voc' and using those codes ?

rauldiaz commented 5 years ago

I never used the train.py script of this repo. I have my own training script and I simply instantiate the DeepLabv3+ model. The 'pascal_voc' weights are simply a model checkpoint that has learnt to segment images from the Pascal VOC dataset. You can also use the weights = 'cityscapes' to start your training script from a pre-trained checkpoint oriented to autonomous driving.

trungpham2606 commented 5 years ago

@rdiazgar oh tks rdiazgar. I will try.

FreedomGu commented 5 years ago

@rauldiza Hi @rauldiaz, I met a problem with the labels, should I fit the labels into model.fit() by the size (Number of images ,image.shape, classes)? I am so confused since the result I got is all 0 when i use .predict() but the accuracy is still very high.

rauldiaz commented 5 years ago

If I understand your question right, you asking what shape should your ground truth labels be, right?

That depends on what the loss function needs. For instance, sparse_categorical_crossentropy expects the labels to be simply the number associated with each class, while categorical_crossentropy expects the labels to be one-hot coded vectors for each class.

In a segmentation scenario like this, if you are using categorical_crossentropy as a loss function, the shape of your labels should be (batch_size, image_height, image_width, classes). If you choose the sparse loss version, the shape should be (batch_size, image_height, image_width, 1).

pissw2016 commented 5 years ago

Hi, so is there anyone fine tune with VOC successfully? I am trying to reproduce the result of deepv3+ without decoder, according to paper it is 81.34(with train 16 eval 8) while I just froze all encoder and :

end (Dropout) (None, 64, 64, 256) 0 activation_76[0][0]


conv_upsample (Conv2D) (None, 64, 64, 21) 5397 end[0][0]


lambda_1 (Lambda) (None, 512, 512, 21) 0 conv_upsample[0][0]


reshape_1 (Reshape) (None, 262144, 21) 0 lambda_1[0][0]


pred_mask (Activation) (None, 262144, 21) 0 reshape_1[0][0]

Total params: 41,093,045 Trainable params: 5,397 Non-trainable params: 41,087,648

fire epoch result:

1464/1464 [==============================] - 1594s 1s/step - loss: 1.4667 - Jaccard: 0.4825 - sparse_accuracy_ignoring_last_label: 0.7597 - val_loss: 0.4599 - val_Jaccard: 0.7528 - val_sparse_accuracy_ignoring_last_label: 0.9455 which is really werid. Jaccard is kind of equal to mIOU details from Golbstein

BN is depends on the data, so when the data distribution change the BN parameter shall be change too. Freeze or not is not a problem. and if the data is changing like transfer learning, I believe the BN layer shall not be freeze . Thus can catch the distribution of new data.

So I think the fine tune might need the data and DA exactly same as the first stage traning.

zhangbo2008 commented 4 years ago

May i ask which paper you want to reproduce?

lauraset commented 4 years ago

Hi @rauldiaz I successfully fine-tuned this model on my own dataset. But when I checked the detailed network structures, I found obvious differences between this model and the original xception (in keras applications) . They are as follows: 1) in the entry flow, all maxpooling layers are replaced with separable convolution 2) the number of the middle flow of xception is changed to 16, while the original one is 8 3) in the exit flow, averagepooling is deprecated.

I am not sure the effect of these changes on the final results.

rauldiaz commented 4 years ago

Hi @lauraset ,

Your question seems more targeted to the original author of this repository (@bonlime), rather than to me.

pimonteiro commented 3 years ago

@rauldiaz Hello! Sorry for undigging such an old thread, but I'm having really bad results on the mobilenetv2 version with the cityscapes weights. The exception works amazingly well, but the mobilenetv2 returns a very blurry image segmentation. The dataset i'm using is the kitti360.

The only thing I modified on the model was the line 172 of the model

in_channels = inputs.shape[-1].value # inputs._keras_shape[-1]

in_channels = inputs.shape.as_list()[-1]

because it was causing an error while creating the model (the change was suggested on issue #125 ).

Did you went through something similar?

Thunder003 commented 3 years ago

hey @Meight, would you please share the steps you followed in fine-tunning the official repo of deeplabV3+? I was tunning it for two-class(background+one foreground) but after some iterations, all of my image pixels start to acquire a single value(1 in my case). Also, please let me know the TensorFlow version you used.