aimerykong / Recurrent-Scene-Parsing-with-Perspective-Understanding-in-the-loop

CVPR2018 - scene parsing network regulated by geometric prior
https://www.cs.cmu.edu/~shuk/recurrentDepthSeg.html
37 stars 7 forks source link

Tensorflow reimplementation issue #8

Closed hondaathma closed 6 years ago

hondaathma commented 7 years ago

Hi @aimerykong , Thanks a lot for all your support.I am reimplmenting your matconvnet code to tensoflow and I was also able to add the mask gating layer.I will post to github as soon as possible.

However I am not able to get the same accuracies as you do in the paper(Even with or without the recurrent loop).More specifically I am talking about these two accuracies:

image

There is only an improvement of 0.2 percent as opposed to you 2% improvement.I am not using any other data augmentation techniques like you are doing expect random cropping.To be more specific could you please look at the folloeing 2 steps I follwed:

image

after this:

image

Note that for the step 2 after the frozen weights,I apply learning late multiplier of 10,20 and 1 to the convolutional weights,biases and batch normalisation layers. Am I missing something? Could you please help me figure out my mitsake.I did only use a batch size of 1 (800x800 image and quantized depth map).I also used Training policy simiplar to deeplab and even tried without the additional res5d block.But no use.I am also using res7_conv final layer and bilinear upsample it to 1024,2048 for testing.Maybe I am testing wrong? Please do let me know what you think.

Some result examples: image

aimerykong commented 7 years ago

Hi, Athma,

Good pictures!

Let me start with the base model and describe the following steps --

  1. Over the base model, specifically over the last layer of resblock5 (2048 dimension), I train two more new layers just for depth classification while freezing all the layers below. When I freeze the layers, I use the global mean and variance in all batch normalization layers. I'm not sure how to do this in Tensorflow.
  2. After Step-1, I freeze the depth branch, and fine-tune the layers for segmentation only. This means that I didn't use multiple losses, but the softmax for semantic segmentation only. Here I use the layers from base model, just apply atrous convolution with various dilate rate according to the quantized depth bin. I only fine-tune the layers belong to the semantic segmentation. Moreover, I didn't let them share weights -- each scale has its own convolution kernels, just initialized from the base model.
  3. After Step-2, I fine-tune more layers including segmentation and depth classification. I also use only one loss for semantic segmentation. Furthermore, I fine-tune more layers including resblock4 and resblock5.

Perhaps one major difference from your experiment is that, I use a single scale to train the base model, other than multiple atrous convolution scales in DeepLab. To train the base model I turn to the pipeline of PSPNet, which inserts intermediate supervision in resblock4 and includes pyramid pooling at very top layers. I guess including multi-scale atrous in your base model decreases the power of the depth gating module? I'm not sure but worth thinking about it, because the major difference I see is this part.

The way you run testing is the same as mine -- feeding the original image and upsample the softmax score maps to 1024x2048 and get the argmax as the prediction.

Thanks for the update, and I'm happy to discuss further!

Regards, Shu

On Fri, Aug 25, 2017 at 6:17 PM, Athma Narayanan Lakshminarayanan < notifications@github.com> wrote:

Hi @aimerykong https://github.com/aimerykong , Thanks a lot for all your support.I am reimplmenting your matconvnet code to tensoflow and I was also able to add the mask gating layer.I will post to github as soon as possible.

However I am not able to get the same accuracies as you do in the paper(Even with or without the recurrent loop).More specifically I am taking about these two accuracies:

[image: image] https://user-images.githubusercontent.com/26149657/29737317-58ae364e-89c0-11e7-8dd0-c2928310ecdc.png

There is only an improvement of 0.2 percent as opposed to you 2% improvement.I am not using any other data augmentation techniques like you are doing expect random cropping.To be more specific could you please look at the folloeing 2 steps I follwed:

[image: image] https://user-images.githubusercontent.com/26149657/29737344-ae372030-89c0-11e7-9044-0edf1a3f74f6.png

after this:

[image: image] https://user-images.githubusercontent.com/26149657/29737360-080789ce-89c1-11e7-820d-525280b7e5db.png

Note that for the step 2 after the frozen weights,I apply learning late multiplier of 10,20 and 1 to the convolutional weights,biases and batch normalisation layers. Am I missing something? Could you please help me figure out my mitsake.I did only use a batch size of 1 (800x800 image and quantized depth map).I also used Training policy simiplar to deeplab and even tried without the additional res5d block.But no use.I am also using res7_conv final layer and bilinear upsample it to 1024,2048 for testing.Maybe I am testing wrong? Please o let me know what you think.

Some result examples: [image: image] https://user-images.githubusercontent.com/26149657/29737392-aa82715a-89c1-11e7-9991-21856aac6aad.png

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aimerykong/Recurrent-Scene-Parsing-with-Perspective-Understanding-in-the-loop/issues/8, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKZJPC1FnMUwpIgkXqmyagXPs7kR6xgks5sb3IwgaJpZM4PDTWC .

hondaathma commented 7 years ago

Hi @aimerykong Thanks a lot for the reply.What I understood was that your alternating loss scheme is different from what I tried to do..I just had multiple loss propagation from depth and segmentation..that might not be the right thing to do. I just had a few more final questions

  1. Over the base model, specifically over the last layer of resblock5 (2048 dimension), I train two more new layers just for depth classification while freezing all the layers below. When I freeze the layers, I use the global mean and variance in all batch normalization layers. Q) You mean after training the basenetwork for segmentation you stored the global mean and variance somewhere. and when training for depth you applied this mean and variance values to all the batchnorm layers in base network only? (ie) from res1-res5 block only and not the depth branch.

  2. Here I use the layers from base model, just apply atrous convolution with various dilate rate according to the quantized depth bin. Moreover, I didn't let them share weights -- each scale has its own convolution kernels, just initialized from the base model. Q) I quite dint understand this part. Are you taking about the scale in atrous convolution dilation parameter.? this atrous convolutions are the ones in my pyramid pooling block in diagram right?

  3. After Step-2, I fine-tune more layers including segmentation and depth classification. I also use only one loss for semantic segmentation. Furthermore, I fine-tune more layers including resblock4 and resblock5. Q)You mean you added more layers and finetuned them? Are you talking about the Reccurent loop models?

  4. Perhaps one major difference from your experiment is that, I use a single scale to train the base model, other than multiple atrous convolution scales in DeepLab. To train the base model I turn to the pipeline of PSPNet, which inserts intermediate supervision in resblock4 and includes pyramid pooling at very top layers. I guess including multi-scale atrous in your base model decreases the power of the depth gating module? I'm not sure but worth thinking about it, because the major difference I see is this part. Q) I think you are right.Maybe the res5c layer used is not good enough to begin with when compared to the res5c in PSPnet pipeline.So what scale did you use for training the base model? Also In PSPNet Deep supervison you said you pulled out the res4b22 layer for additional auxiliary loss.But the number of channels dont match right? I mean how did you deal with res4b22 layer outputting say 1024 channels as opposed to num_classes as in final res5c layer?

  5. In your paper you mentioned that you unroll the recurrent part of the network one at a time..Does that mean you trained the BASEMODEL+ReccurentLAYER1 alone. and then froze those weights and trained BASEMODEL+RecuurentLAYER1+ReecurrentLAYER2 and so on.

May I ask you if you have the training script for the base model in matconvnet?I see you have provided only the script for the Reccurrent part.

Thanks a lot !

aimerykong commented 7 years ago

Hi, Athma,

First of all, when we train the models for semantic segmentation, our thought is that the major or perhaps the only task is segmentation, not depth estimation. We train depth estimation branch just for better segmentation. Therefore, we didn't adopt multi-task fashion to take care of both of them; but focus more on the segmentation.

  1. Yes. I freeze the batch normalization layers and use the global statistics (mean variance) from the base model. I don't know how tensorflow supports this; but matconvnet enables me to modify the code to train new layers while fixing all layers below. [ http://www.vlfeat.org/matconvnet/mfiles/vl_nnbnorm/] I'm sure that tensoflow also has some mechanisms storing the global variance and mean; for different batch, the only thing depending on batches is the local or batch-level mean and variance.

As for depth branch, I include it to fine-tune for segmentation. This means we care more about segmentation.

  1. For atrous convolution, you are right. The scale means the dilate rate for atrous convolution.

  2. This step is still in the feed-forward pathway, not including recurrent module yet. Once I get to train good new layers. I further finetune more layers to improve the model. All these is done before training recurrent module.

  3. For the output scale from the base model, I use the same trick as DeepLab. Specifically, I increase dilate rate from 1 to 2 and 4 for convolution layers in resBlock4 and resBlock5, respectively; in the mean time, I didn't allow them to do pooling. This is the same what DeepLab does.

  4. Once we train a good feed-forward pathway model, we start to train recurrent model. The trick to successfully train the recurrent model is to train the first loop first. Then train the 2nd loop starting with the same weight as trained in the 1st loop. NOTE that we freeze the weights EXCEPT the last layer of 1st loop. In other word, we fix the last classification layer for all the loops, and only train layers in between. Our thought is that, as all loops should predict segmentation for the pixels, the statistics of the pixel features should not change too much, as the prediction on most pixels are deterministic. This also means that the last classification layer for all loops can be fixed without big problem. In practice, this also guarantees a good training procedure. When training more loops, we add a loss to each of the loops. But at our end, we only tried to train the loop one by one. It's worth comparing that to training all loops simultaneously; but we don't have enough computational resource to do that. Anyway, after training the loops, we fine-tune all loops again including ALL classification layers.

For training the base model, I don't have matconvnet script... Previously I used the pspnet caffe script and convert it to matconvnet model. But I saw friends using tensorflow to train pspnet.

Regards, Shu

On Mon, Aug 28, 2017 at 12:21 PM, Athma Narayanan Lakshminarayanan < notifications@github.com> wrote:

Hi @aimerykong https://github.com/aimerykong Thanks a lot for the reply.What I understood was that your alternating loss scheme is different from what I tried to do..I just had multiple loss propagation from depth and segmentation..that might not be the right thing to do. I just had a few more final questions

1.

Over the base model, specifically over the last layer of resblock5 (2048 dimension), I train two more new layers just for depth classification while freezing all the layers below. When I freeze the layers, I use the global mean and variance in all batch normalization layers. Q) You mean after training the basenetwork for segmentation you stored the global mean and variance somewhere. and when training for depth you applied this mean and variance values to all the batchnorm layers in base network only? (ie) from res1-res5 block only and not the depth branch. 2.

Here I use the layers from base model, just apply atrous convolution with various dilate rate according to the quantized depth bin. Moreover, I didn't let them share weights -- each scale has its own convolution kernels, just initialized from the base model. Q) I quite dint understand this part. Are you taking about the scale in atrous convolution dilation parameter.? this atrous convolutions are the ones in my pyramid pooling block in diagram right?

3.After Step-2, I fine-tune more layers including segmentation and depth classification. I also use only one loss for semantic segmentation. Furthermore, I fine-tune more layers including resblock4 and resblock5. Q)You mean you added more layers and finetuned them? Are you talking about the Reccurent loop models?

4.Perhaps one major difference from your experiment is that, I use a single scale to train the base model, other than multiple atrous convolution scales in DeepLab. To train the base model I turn to the pipeline of PSPNet, which inserts intermediate supervision in resblock4 and includes pyramid pooling at very top layers. I guess including multi-scale atrous in your base model decreases the power of the depth gating module? I'm not sure but worth thinking about it, because the major difference I see is this part. I think you are right.Maybe the res5c layer used is not good enough to begin with when compared to the res5c in PSPnet pipeline.So what scale did you use for training the base model?

5.Finally in your paper you menioned that you unroll the recurrent part of the network one at a time..Does that mean you trained the BASEMODEL+ReccurentLAYER1 alone. and then froze those weights and trained BASEMODEL+RecuurentLAYER1+ReecurrentLAYER2 and so on.

May I ask you if you have the training script for the base model in matconvnet?I see you have provided only the script for the Reccurrent part.

Thanks a lot !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aimerykong/Recurrent-Scene-Parsing-with-Perspective-Understanding-in-the-loop/issues/8#issuecomment-325452512, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKZJBq4XCw-c8xkkFo3y6LiNum7b-2uks5scxM3gaJpZM4PDTWC .

aimerykong commented 7 years ago

Sorry, I found I messed up some details to answer your question 5 --

To train loop1, I randomly initialize most parameters in loop1, BUT I initialize the classification layer (last layer) in loop1 using the trained layer of the feed-forwad model. We essentially freeze that last layer when training loop-1, loop-2, or more. The reason we do this is that, the "shared" last layers in all loops enables us to roll the loops back to a really recurrent mechanism -- NOT just unrolled version.

After loop-2/3/4 (we tried different numbers of loops), we fine-tune the whole network using very small learning rate. But we didn't observe noticeable improvement. We believe this means somewhere else in the pipeline is worth studying, as we don't want to blame the data size...

Hope this helps also.

On Tue, Aug 29, 2017 at 12:04 PM, Shu Kong (Aimery) aimerykong@gmail.com wrote:

Hi, Athma,

First of all, when we train the models for semantic segmentation, our thought is that the major or perhaps the only task is segmentation, not depth estimation. We train depth estimation branch just for better segmentation. Therefore, we didn't adopt multi-task fashion to take care of both of them; but focus more on the segmentation.

  1. Yes. I freeze the batch normalization layers and use the global statistics (mean variance) from the base model. I don't know how tensorflow supports this; but matconvnet enables me to modify the code to train new layers while fixing all layers below. [http://www.vlfeat.org/ matconvnet/mfiles/vl_nnbnorm/] I'm sure that tensoflow also has some mechanisms storing the global variance and mean; for different batch, the only thing depending on batches is the local or batch-level mean and variance.

As for depth branch, I include it to fine-tune for segmentation. This means we care more about segmentation.

  1. For atrous convolution, you are right. The scale means the dilate rate for atrous convolution.

  2. This step is still in the feed-forward pathway, not including recurrent module yet. Once I get to train good new layers. I further finetune more layers to improve the model. All these is done before training recurrent module.

  3. For the output scale from the base model, I use the same trick as DeepLab. Specifically, I increase dilate rate from 1 to 2 and 4 for convolution layers in resBlock4 and resBlock5, respectively; in the mean time, I didn't allow them to do pooling. This is the same what DeepLab does.

  4. Once we train a good feed-forward pathway model, we start to train recurrent model. The trick to successfully train the recurrent model is to train the first loop first. Then train the 2nd loop starting with the same weight as trained in the 1st loop. NOTE that we freeze the weights EXCEPT the last layer of 1st loop. In other word, we fix the last classification layer for all the loops, and only train layers in between. Our thought is that, as all loops should predict segmentation for the pixels, the statistics of the pixel features should not change too much, as the prediction on most pixels are deterministic. This also means that the last classification layer for all loops can be fixed without big problem. In practice, this also guarantees a good training procedure. When training more loops, we add a loss to each of the loops. But at our end, we only tried to train the loop one by one. It's worth comparing that to training all loops simultaneously; but we don't have enough computational resource to do that. Anyway, after training the loops, we fine-tune all loops again including ALL classification layers.

For training the base model, I don't have matconvnet script... Previously I used the pspnet caffe script and convert it to matconvnet model. But I saw friends using tensorflow to train pspnet.

Regards, Shu

On Mon, Aug 28, 2017 at 12:21 PM, Athma Narayanan Lakshminarayanan < notifications@github.com> wrote:

Hi @aimerykong https://github.com/aimerykong Thanks a lot for the reply.What I understood was that your alternating loss scheme is different from what I tried to do..I just had multiple loss propagation from depth and segmentation..that might not be the right thing to do. I just had a few more final questions

1.

Over the base model, specifically over the last layer of resblock5 (2048 dimension), I train two more new layers just for depth classification while freezing all the layers below. When I freeze the layers, I use the global mean and variance in all batch normalization layers. Q) You mean after training the basenetwork for segmentation you stored the global mean and variance somewhere. and when training for depth you applied this mean and variance values to all the batchnorm layers in base network only? (ie) from res1-res5 block only and not the depth branch. 2.

Here I use the layers from base model, just apply atrous convolution with various dilate rate according to the quantized depth bin. Moreover, I didn't let them share weights -- each scale has its own convolution kernels, just initialized from the base model. Q) I quite dint understand this part. Are you taking about the scale in atrous convolution dilation parameter.? this atrous convolutions are the ones in my pyramid pooling block in diagram right?

3.After Step-2, I fine-tune more layers including segmentation and depth classification. I also use only one loss for semantic segmentation. Furthermore, I fine-tune more layers including resblock4 and resblock5. Q)You mean you added more layers and finetuned them? Are you talking about the Reccurent loop models?

4.Perhaps one major difference from your experiment is that, I use a single scale to train the base model, other than multiple atrous convolution scales in DeepLab. To train the base model I turn to the pipeline of PSPNet, which inserts intermediate supervision in resblock4 and includes pyramid pooling at very top layers. I guess including multi-scale atrous in your base model decreases the power of the depth gating module? I'm not sure but worth thinking about it, because the major difference I see is this part. I think you are right.Maybe the res5c layer used is not good enough to begin with when compared to the res5c in PSPnet pipeline.So what scale did you use for training the base model?

5.Finally in your paper you menioned that you unroll the recurrent part of the network one at a time..Does that mean you trained the BASEMODEL+ReccurentLAYER1 alone. and then froze those weights and trained BASEMODEL+RecuurentLAYER1+ReecurrentLAYER2 and so on.

May I ask you if you have the training script for the base model in matconvnet?I see you have provided only the script for the Reccurrent part.

Thanks a lot !

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/aimerykong/Recurrent-Scene-Parsing-with-Perspective-Understanding-in-the-loop/issues/8#issuecomment-325452512, or mute the thread https://github.com/notifications/unsubscribe-auth/AGKZJBq4XCw-c8xkkFo3y6LiNum7b-2uks5scxM3gaJpZM4PDTWC .

dyz-zju commented 7 years ago

Hi, Athma, Recently, I am also concerned about the semantic segmentation related issues, could you share me the code you implementation using tensorflow. I want to kown the detail with tensorflow reimplementation.

Thanks!

myhooo commented 6 years ago

@hondaathma Hello, Athma. I am also concerned about that whether you have finished the implementation. And, could you share your code. Thanks in advance.^_^

hondaathma commented 6 years ago

@myhooo @dyz-zju I unfortunately cannot share the code but I can help you easily try it out. First start from https://github.com/DrSleep/tensorflow-deeplab-resnet . Now you need to change model.py in deeplab_resnet to support depth. For