DrSleep / tensorflow-deeplab-resnet

DeepLab-ResNet rebuilt in TensorFlow
MIT License
1.25k stars 431 forks source link

Finetuning on new dataset / Modify input images on the fly #12

Closed mgarbade closed 7 years ago

mgarbade commented 7 years ago

How can I modify images on the fly? Say I would like to set a certain area of the input images region to 0? Where in your code would I need to do the surgery for that?
Rather in the ImageReader function where the image is loaded?
Or rather in the network graph itself, say by adding a layer after the data layer in DeepLabResNetModel that multiplies elementwise with some mask?

Sorry for bothering you with this stupid question. I'm new to tensorflow. Also sorry for asking usage-question, but since your code differs quite a lot from the tensorflow tutorial code I don't really know where else to turn for that question...Thanks a lot for providing the deeplab-resnet model for tensorflow!

DrSleep commented 7 years ago

You can modify the image either directly (when the ImageReader methods are defined) - https://github.com/DrSleep/tensorflow-deeplab-resnet/blob/master/deeplab_resnet/image_reader.py#L30, or in the main script (modifying a batch of images), e.g. after this line: https://github.com/DrSleep/tensorflow-deeplab-resnet/blob/master/train.py#L113. In both cases you can just multiply the image (or a batch of images) by some mask.

mgarbade commented 7 years ago

Thanks a lot for your reply! I guess it is working. One more question if you don't mind. I'd like to finetune the model on a custom dataset, so I need to reinitialize the last layer randomly and reshape it to match the number of classes in the new dataset. It seemed to me like this is already done in your code in these lines:


    # Create network.
    net = DeepLabResNetModel({'data': image_batch})

    # Predictions.
    raw_output = net.layers['fc1_voc12']

    prediction = tf.reshape(raw_output, [-1, n_classes])
    label_proc = prepare_label(label_batch, tf.pack(raw_output.get_shape()[1:3]))
    gt = tf.reshape(label_proc, [-1, n_classes])

But if I simply change n_classes it throws an error:

ValueError: Dimension size must be evenly divisible by 11 but is 141204 for 'Reshape' (op: 'Reshape') with input shapes: [4,41,41,21], [2].

DrSleep commented 7 years ago

You will also need to change the model definition here: https://github.com/DrSleep/tensorflow-deeplab-resnet/blob/master/deeplab_resnet/model.py#L391. In particular, you would need to replace 21 with your number of classes in the calls to atrous convolution. Note also that if you change the number of classes, but keep the names of the layers intact, restoring the original model parameters would not be possible since the shapes of the layers are different. Besides renaming, you can overcome this issue with loading only the weights for all the layers except the last ones: please refer to another issue for that: #11.

mgarbade commented 7 years ago

How would I have to call the saver object in that case?

Right now its called like this

saver.restore(sess, 'ckpt_path/deeplab_resnet.ckpt')

It presumably fails since the saver object tries to restore the weights in ckpt_path/deeplab_resnet.ckpt for the exact network structure that it was saved for. In the link you posted you show how to get a list of the layers that need to be reinitialized

not_restore = ['fc1_voc12_c0', 'fc1_voc12_c1', 'fc1_voc12_c2', 'fc1_voc12_c3']
restore_var = [v for v in tf.all_variables() if v.name not in not_restore] 

But I'm not sure how to initialize this list (restore_var) together with the saver.restore method which loads the weights that are stored in the checkpoint file.

DrSleep commented 7 years ago

When you initialise the instance of the Saver class, you can pass the var_list argument, which specifies the variables that will be saved and restored. Then you can call the restore method as usual (all the variables from the restore_var list must be presented in the checkpoint file, otherwise it will raise an error; the inverse is not needed to be satisfied here: your checkpoint file can hold other variables that you don't want to restore).

mgarbade commented 7 years ago

Thanks to your help I modified the initialization part thus:

trainable = tf.trainable_variables() 
optim = optimiser.minimize(reduced_loss, var_list=trainable)

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
init = tf.initialize_all_variables()

sess.run(init) # All tensors in the graph are initialized with their initial values -> not clear what they are? All 0?

# Restore everything but the last layer
restore_var = [v for v in trainable if not v.name.startswith('fc1_voc12')] # Only excluding ['fc1_voc12_c0', 'fc1_voc12_c1', 'fc1_voc12_c2', 'fc1_voc12_c3'] is apparantly not enough here
saver = tf.train.Saver(var_list=restore_var, max_to_keep=40) 
saver.restore(sess, 'ckpt_path/deeplab_resnet.ckpt')

Now the network is compiling and starting to train, however the network is not converging (not even on Voc12 itself without having changed the number of classes). The loss is decreasing a bit in the beginning but remains on a high level. When looking at the output pictures one can see that the network first predicts noise and then only predicts the background class for all images (which is probably the dominant class in Voc12)

1) Maybe the last layer is not initialized with random noise? 2) Any idea how to go about the random initialization? 3) Maybe the tf.train.AdamOptimizer is the wrong optimizer here? (I remember that in Deeplab-Caffe it was some Gradient-Descent with Momentum) 4) Did you ever successfully finetune deeplab-resnet on a different dataset or did you just use it for inference so far?

Thanks a lot for your help so far. Again sorry for annoying you with this problem but I'm close to giving up :-/

mgarbade commented 7 years ago

In the file network.py there is a function def make_var which apparently is called to create all the variables in the network. I tried to add an initialization parameter there but still no luck so far unfortunately but looks like this is the place to look at...

    def make_var(self, name, shape):
        '''Creates a new TensorFlow variable.'''
        # return tf.get_variable(name, shape, trainable=self.trainable) #  This was the old version
        return tf.get_variable(name, shape, trainable=self.trainable,initializer=tf.contrib.layers.xavier_initializer()) # This is the new one with explizit variable initialization
mgarbade commented 7 years ago

Ok, I found a way to make the model at least converge. I set the optimizer only to the last layer:

restore_var = [v for v in trainable if not v.name.startswith('fc1_voc12_c')]
not_restore_var = [v for v in trainable if v.name.startswith('fc1_voc12_c')]

optim = optimiser.minimize(reduced_loss, var_list=not_restore_var)

saver = tf.train.Saver(var_list=restore_var, max_to_keep=40)
if args.restore_from is not None:
    load(saver, sess, args.restore_from)

I'll try to run it with two different optimizers, one with low learning rate for the earlier layers and one with higher learning rate for the last layer, hopefully that is solving my problem. I will report you about the outcome...

DrSleep commented 7 years ago

Hi @mgarbade.

The divergence is happening due to the fact that the batch normalisation layer used in kaffe-tensorflow is only tested for inference. It has been reported before: #5. For now, it is better to use another branch with correct batch normalisation: https://github.com/DrSleep/tensorflow-deeplab-resnet/tree/batch-norm. The script for fine-tuning is also provided there.

Let me know if the problem persists.

mgarbade commented 7 years ago

Well I'm happy to see that you are evolving your code :-). I'm still struggeling to get the same performance on my dataset (CamVid - 11 classes) as with the Caffe-Model of Deep-Lab-Resnet. At the moment I'm still 20% off (IoU).
Differences are :

Here are their learning rate parameter:

base_lr: 2.5e-4        # base learning rate
weight_decay: 0.0005   # weights are regularized by adding L2-norm * weight_decay to the loss function afaik 

learning rate scale factor for convolutions (everything but last layer):

lr_mult: 1
decay_mult: 1

learning rate scale factor for the atrous convolutions (classifier layers):

lr_mult: 10      
decay_mult: 1 

--> This is not implemented yet / not sure if it would work. tf.select(condition,TensorA,TensorB) checks for all entries in loss whether they correspond to a ignore_label value in the ground truth (gt_label) and sets them to 0. So they don't contribute to the loss.

Sorry for the long post. I will keep on trying to push the fine-tuning performance and let you know if I can make it... Thanks

DrSleep commented 7 years ago

Thank you for your description.

I will look closely at ignore_label during training, and will try to provide a training script that better resembles the original procedure.

mgarbade commented 7 years ago

Thanks a lot for providing the ignore_label feature. I'm still looking forward to close the 10% performance gap compared to the caffe version of deeplab-resnet.

I identified some more differences:

I will try to convert the caffemodel. Hopefully that will allow me to close the performance gap...

mgarbade commented 7 years ago

I just saw that the model pretrained on ms-coco was just the exact same that you provided as init model. I further tried to monitor the development of the variables of the neural network by adding tf.summary.histogram loggers to all trainable variables in the network like this (based on your train.txt in the train-orig branch:

    for v in conv_trainable + fc_w_trainable + fc_b_trainable: # Add histogram to all variables
        tf.summary.histogram(v.name.replace(":","_"),v)
    merged_summary_op = tf.summary.merge_all() 

It looks like only the last layer is learning something, the earlier layers seem to not change at all:

screenshot from 2017-02-08 17 29 22

conv1/weights_0 is the first convolution layer. The other layers look the same.

fc1_voc12_c0/biases_0 and fc1_voc12_c0/weights_0 are the convolution weights from the last layers. Here, at least the bias is changing. Weights are again almost unchanged.

This pattern stays the same for more iterations... I will play around with the learning rate, but it seems like the optimization is not working correctly...

mgarbade commented 7 years ago

Might also be that the loss, that is computed in the Caffe version is much higher since they use an accumulative loss. Apparantly they add up the loss over 10 iterations (iter_size = 10) while using a batch_size of 3. Only after that they perform the backpropagation. So maybe their effective batch size is 30, which then produces a higher loss as compared to the batch_size of 10 which is used here. Could be that this is the reason why earlier layers have trouble learning...

DrSleep commented 7 years ago

@mgarbade, where batch_size=3 is coming from? In the train.prototxt provided by the authors, it is 1, isn't it?

In the original implementation, they also use 4 losses as I mentioned here, which should improve the gradient flow, as well.

Besides tracking the raw variable values, try also to track the ratio between gradient updates and parameter values (it is, in my opinion, a better indicator of whether the layer is learning something or not). Here is some pseudo code from the Karpathy's class on CNNs:

# assume parameter vector W and its gradient vector dW
param_scale = np.linalg.norm(W.ravel())
update = -learning_rate*dW # simple SGD update
update_scale = np.linalg.norm(update.ravel())
W += update # the actual update
print update_scale / param_scale # want ~1e-3
mgarbade commented 7 years ago

You are right. In the original code with the multiscale fusion, they have a batch_size of 1. In my version I had the multiscale part removed so I could have a batch_size of 3. Sorry for the confusion.

I'm not so sure about the indvidual losses for the different branches. Do they simply add them to the gradients during backpropagation or how do they combine them?

Good idea with the update_scale / param_scale! I'll check that. By the way: I updated my preprocessing to random cropping and 0-padding (images) / ignore_label-padding (labels). It gave me a huge boost (+10 % accuracy) on my other datasets (CamVid and Cityscapes). So although this might not be very important for Pascal Voc12, it apparantly is for other datasets.

Here is how I preprocess images at the moment:

def read_images_from_disk(input_queue, img_type, phase, input_size = (321,321), ignore_label = 255): 
    img_contents = tf.read_file(input_queue[0])
    label_contents = tf.read_file(input_queue[1])

    if img_type == 1:
        img = tf.image.decode_jpeg(img_contents, channels=3) # VOC12
    else:
        img = tf.image.decode_png(img_contents, channels=3) # CamVid

    label = tf.image.decode_png(label_contents, channels=1)

    # Change RGB to BGR
    img_r, img_g, img_b = tf.split(split_dim=2, num_split=3, value=img)
    img = tf.cast(tf.concat(2, [img_b, img_g, img_r]), dtype=tf.float32)    

    # Mean subtraction 
    IMG_MEAN = tf.constant([104.00698793,116.66876762,122.67891434],shape=[1,1,3], dtype=tf.float32) # BGR
    IMG_MEAN = tf.reshape(IMG_MEAN,[1,1,3]) 
    img = img - IMG_MEAN

    # Optional preprocessing for training phase    
    if phase == 'train':
        img, label = preprocess_input_train(img, label, ignore_label )
    elif phase == 'valid':
        # TODO: Perform only a central crop -> size should be the same as during training
        pass
    elif phase == 'test':
        pass

    return img, label    

using

def preprocess_input_train(img, label, ignore_label ):
    # Scale
    scale = tf.random_uniform([1], minval=0.5, maxval=1.5, dtype=tf.float32, seed=None)
    h_new = tf.to_int32(tf.mul(tf.to_float(tf.shape(img)[0]), scale))
    w_new = tf.to_int32(tf.mul(tf.to_float(tf.shape(img)[1]), scale))
    new_shape = tf.squeeze(tf.pack([h_new, w_new]), squeeze_dims=[1])
    img = tf.image.resize_images(img, new_shape)
    label = tf.image.resize_nearest_neighbor(tf.expand_dims(label, 0), new_shape)
    label = tf.squeeze(label, squeeze_dims=[0])

    # Mirror
    random_number = tf.random_uniform([2], 0, 1.0, dtype=tf.float32)
    img = image_mirroring(img, random_number)
    label = image_mirroring(label, random_number)

    # Crop and pad image
    label = tf.cast(label, dtype=tf.float32) # Needs to be subtract and later added due to 0 padding
    label = label - ignore_label
    crop_h, crop_w = [321,321]
    img_crop, label_crop = random_crop_and_pad_image_and_labels(img, label, crop_h, crop_w)
    label_crop = label_crop + ignore_label
    label_crop = tf.cast(label_crop, dtype=tf.uint8)

    # Set static shape so that tensorflow knows shape at compile time 
    img_crop.set_shape((crop_h, crop_w, 3))
    label_crop.set_shape((crop_h,crop_w, 1))
    return img_crop, label_crop  

def image_mirroring(image, random_number):
    distort_left_right_random = random_number[0]
    mirror = tf.less(tf.pack([1.0, distort_left_right_random, 1.0]), 0.5)
    image = tf.reverse(image, mirror)
    return image

and for cropping with padding

def random_crop_and_pad_image_and_labels(image, labels, crop_h, crop_w):
    combined = tf.concat(2, [image, labels]) 
    image_shape = tf.shape(image)
    combined_pad = tf.image.pad_to_bounding_box(
        combined, 0, 0,
        tf.maximum(crop_h, image_shape[0]),
        tf.maximum(crop_w, image_shape[1]))

    last_image_dim = tf.shape(image)[-1]
    last_label_dim = tf.shape(labels)[-1]
    combined_crop = tf.random_crop(combined_pad,[crop_h,crop_w,4]) # TODO: Make cropping size a variable

    return (combined_crop[:, :, :last_image_dim],
            combined_crop[:, :, last_image_dim:])

Mind that the padding for the labels has to be done with "ignore_label". Since TF only performs a 0-padding I'm subtracting the ignore_label from label and add it again after the padding.

DrSleep commented 7 years ago

I'm not so sure about the indvidual losses for the different branches. Do they simply add them to the gradients during backpropagation or how do they combine them?

Yes, the Caffe mechanism takes care of that and adds all the gradients.

Nice work with pre-processing! Would be great if you could wrap it up as a PR :)

mgarbade commented 7 years ago

Thanks. I'm a bit busy at the moment so I just made a dirty PR from the last state of my fork. When I have more time I will clean up the code and make a better PR. At the very least the dirty PR contains the random image cropping and padding part in the file image_reader.py (same functions as the ones I posted above)

petteriTeikari commented 7 years ago

@DrSleep : The batch-norm probably have been merged with the master as I did not see any difference with the master and batch-norm

I modified the model.py as following:

n_classes = 6
...
(self.feed('res5b_relu', 
                   'bn5c_branch2c')
             .add(name='res5c')
             .relu(name='res5c_relu')
             .atrous_conv(3, 3, n_classes, 6, padding='SAME', relu=False, name='fc1_voc12_c0'))

        (self.feed('res5c_relu')
             .atrous_conv(3, 3, n_classes, 12, padding='SAME', relu=False, name='fc1_voc12_c1'))

        (self.feed('res5c_relu')
             .atrous_conv(3, 3, n_classes, 18, padding='SAME', relu=False, name='fc1_voc12_c2'))

        (self.feed('res5c_relu')
             .atrous_conv(3, 3, n_classes, 24, padding='SAME', relu=False, name='fc1_voc12_c3'))

And starts the fine-tuning process with no convergence with n_classes = 21 but hits the following error when changing it to 6 which is my actual number of classes for my custom dataset:

ValueError: Dimension 0 in both shapes must be equal, but are 6724 and 23534 for 'SoftmaxCrossEntropyWithLogits' (op: 'SoftmaxCrossEntropyWithLogits') with input shapes: [6724,6], [23534,6].

Where is that multiplier of 1120.6667 (6724/6, 23534/21) really coming from? What other layers should I change?

DrSleep commented 7 years ago

You should change n_classes in the training script that you are using as well. Also modify https://github.com/DrSleep/tensorflow-deeplab-resnet/blob/master/deeplab_resnet/utils.py accordingly.

petteriTeikari commented 7 years ago

Thanks for the help @DrSleep , I did not indeed notice that those had been defined there as well.

I hit the following now which probably means that the restore part is not functioning properly:

Caused by op u'save_1/Assign_419', defined at:
  File "fine_tune.py", line 201, in <module>
    main()
  File "fine_tune.py", line 179, in main
    loader = tf.train.Saver(var_list=restore_var)
  File "/home/petteri/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/saver.py", line 1056, in __init__
    self.build()

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [6] rhs shape= [21]
     [[Node: save_1/Assign_417 = Assign[T=DT_FLOAT, _class=["loc:@fc1_voc12_c0/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/gpu:0"](fc1_voc12_c0/biases, save_1/RestoreV2_417/_13)]]

And got the fine-tuning to start with the tweaks by @mgarbade though

fine_tune_py.txt

DrSleep commented 7 years ago

You are right that the restore part is not functioning properly: the last layers of your network differ from the original one (6 vs. 21 feature maps), thus when restoring you are receiving the error. The solution is to restore all the layers but the last ones (fc1): instead of restore_var = tf.global_variables(), you should use restore_var = [v for v in tf.global_variables() if 'fc1' not in v.name]

DrSleep commented 7 years ago

https://github.com/DrSleep/tensorflow-deeplab-resnet#using-your-dataset

sunbin1205 commented 6 years ago

@petteriTeikari Have you solved the problem? I have the same problem!

InvalidArgumentError (see above for traceback): Assign requires shapes of both tensors to match. lhs shape= [2] rhs shape= [21] [[Node: save_1/Assign_417 = Assign[T=DT_FLOAT, _class=["loc:@fc1_voc12_c0/biases"], use_locking=true, validate_shape=true, _device="/job:localhost/replica:0/task:0/cpu:0"](fc1_voc12_c0/biases, save_1/RestoreV2_417)]]