haowang1013 commented 3 years ago

Hi,

I'm not entirely sure how the MobilenetV3 models handle input images of different aspect ratio. For example, possible inputs can be of 4:3, 16:9, 3:4 or 9:16.

Does the model work equally well under these different aspect ratios? Or should the image be padded to be squared before sent to the inference (so that the actual content is not distorted due to resizing).

Thanks in advance for any clarification.

anilsathyan7 commented 3 years ago

The original model was actually trained on square images(1:1) aspect ratios. All the (original)images were resized to size 256x256(directly) before feeding to the model, at the time of training. So during inferece we need to resize the images to the same aspect ratio(1:1) as a preprocessing step.

The idea is as follows:- we keep the original image with original aspect ratio, feed square image to model, resize the square output mask(result) of the model to match the aspect ratio of the original image and finally we crop the original image with resized mask. I have used this approach and it seems to work without any issues, so far.
Another approach is to keep the max dimension of the original image(say height) to maxsize(say 256), reduce the smaller dimension based on aspect ratio, and pad the image as you mentioned. This approach is used in deeplab segmentaion demo.
Lastly you can also train the model with images in different aspect ratios(say reszie all to 4:3 ration) so that the model itself accepts the input in that particular aspect ratio.

I have not tried and compared all the approaches, anyway. If during the training process we are resizing images in different aspect rations to a fixed size(along with mask) , then essentially we are teaching the model to predict output for 'the same kind of resized images' . So, i guess it should work well with all kinds of images. On the other hand the resized mask edges would appear smoother if original aspect ratios are same(or else use blurring/antialiasing etc).

haowang1013 commented 3 years ago

Thanks for the clarification, that clears up a few things for me :)

Can you tell me what the aspect ratio of the images used for the pre-trained model is?

anilsathyan7 commented 3 years ago

Which pretrained model? If it uses aisegment dataset for training, then the original images have resolution: 600x800(WxH)

haowang1013 commented 3 years ago

Right now I'm mainly testing mnv3_seg_float.tflite, I assume it was trained with aisegment?

On a side note, we found that mnv3_seg_quant actually runs slower compared to mnv3_seg_float, with the GPU/Metal delegate on both Android and iOS, which is interesting. Are you seeing the same result?

anilsathyan7 commented 3 years ago

It depends on the cpu and gpu(h/w in general), i guess. It should perform better with int accelerators DSP/NPU, when all the layers are supported . I had similar experience during tflite benchmarking:

On POCO X3 android phone, the float model takes around 17ms on CPU and 9ms on it's GPU (>100 FPS), whereas the quantized model takes around 15ms on CPU (2 threads).

haowang1013 commented 3 years ago

Cool.

Currently we're doing our own training based on your setup, it looks like it'll take a while until the quality is acceptable. BTW, the quality of your pre-trained model is awesome :)

In the meantime, we want to do some experiment with CoreML on iOS, any change you could share the original model for mnv3_seg_float.tflite so that we can do the CoreML conversion for testing?

anilsathyan7 commented 3 years ago

What kind of dataset are you using ? Is it for single images or real-time videos? I don't have expertise in iOS/CoreML ... May be you can refer this site: https://machinethink.net/

haowang1013 commented 3 years ago

What kind of dataset are you using ? Is it for single images or real-time videos? I don't have expertise in iOS/CoreML ... May be you can refer this site: https://machinethink.net/

Currently we're training with aisegment plus some custom dataset.

I was thinking perhaps you could share the original h5 model for mnv3_seg_float.tflite, from which a CoreML model can be converted and allows us to test how it performs on iOS?

anilsathyan7 commented 3 years ago

Actually, i did not save the h5 models while i was experimenting on the mnv3 architecture. You can train your own models, in google colab using the ipynb notebooks and aisegment dataset. It would only take 3-4 hours ...

haowang1013 commented 3 years ago

That's cool, we'll do that and let you know how it works on CoreML.

Again, thanks for all the info!

haowang1013 commented 3 years ago

@anilsathyan7 , hope I can bother you with another question.

I noticed that in the DataLoader class used for the MobileNetV3 training, some places used the saved seed when calling tf.random.uniform, and others don't. Any particular reason for the 2 scenarios?

I'm looking into speed up the training by enabling caching in the data input pipeline (10x - 20x speed up), basically do this in the DataLoader class: `data = data.cache()

if shuffle:

Shuffle, repeat, batch and prefetch

data = data.shuffle(1000).repeat().batch(
    batch_size).prefetch(prefetch)

else:

Batch and prefetch

data = data.repeat().batch(batch_size).prefetch(prefetch)`

This resulted in less accurate model after the same number epoch, I think it's due to some of the randomization only runs once during input augmentation process, thus I want to understand why you made the particular choice of when to use the saved seed in certain places.

Thanks

anilsathyan7 commented 3 years ago

Yes, it's just to make sure we are doing same preprocessign steps to image and mask for those operations. In other cases, the mask is same even if the image changes(eg: brightness).

I faced some weird behaviour using older versions of tensorflow with caching. Please refer the latest tensorflow documentation for improving the performace of data pipelines.

haowang1013 commented 3 years ago

I did a simple change by re-arranging the order of the input augmentation in the DataLoader class.

Previous the order is _corrupt_brightness -> _corrupt_contrast -> _corrupt_saturation -> _flip_left_right -> _crop_random -> _resize_data

The first 3 steps are the most expensive ones and since I'm training with 600x800 images, they become the main bottleneck in the pipeline.

Now I'm doing it in this order _crop_random -> _resize_data -> _corrupt_brightness -> _corrupt_contrast -> _corrupt_saturation -> _flip_left_right

And it only takes 1/5 of the original epoch time.

I think the latter version should be as effective as the original one in terms of augmenting the input, what do you think?

anilsathyan7 commented 3 years ago

I should be almost equivalent. May be you can compare both approaches and see the results on accuracy. Another approach would be to resize images beforehand and use tfrecords for dataset. There may be some tradeoffs in each case and you will have to try out each one and choose the best option as per your requirements.

haowang1013 commented 3 years ago

Hi, @anilsathyan7 I have some questions about the details of your mobilentv3 setup, wondering if you could shed some light.

if (finetune):
    for layer in mnv3.layers[:-3]:
        layer.trainable = False

During the initial training, all but the last 3 layers are frozen. Are you doing this because the training uses the imagenet pre-trained model? Why are the last 3 layers not frozen?

# Decoder
x = mnv3.layers[-4].output

x = bottleneck(x, 288)
# x = Conv2DTranspose(filters=288, kernel_size=3, strides=2, padding = 'same', use_bias=False)(x)
x = UpSampling2D(size=(2, 2), interpolation='bilinear')(x)
# concatenate([x, mnv3.layers[71].output], axis = 3) # 75 l
x = Add()([x, mnv3.layers[74].output])

x = bottleneck(x, 96)
# x = Conv2DTranspose(filters=96, kernel_size=3, strides=2, padding = 'same', use_bias=False)(x)
x = UpSampling2D(size=(2, 2), interpolation='bilinear')(x)
x = Add()([x, mnv3.layers[30].output])  # 32

x = bottleneck(x, 72)
# x = Conv2DTranspose(filters=72, kernel_size=3, strides=2, padding = 'same', use_bias=False)(x)
x = UpSampling2D(size=(2, 2), interpolation='bilinear')(x)
x = Add()([x, mnv3.layers[12].output])  # 13

x = bottleneck(x, 16)
# x = Conv2DTranspose(filters=72, kernel_size=3, strides=2, padding = 'same', use_bias=False)(x)
x = UpSampling2D(size=(2, 2), interpolation='bilinear')(x)

# x = Conv2DTranspose(filters=8, kernel_size=3, strides=2, padding='same', use_bias=False)(x)
x = UpSampling2D(size=(2, 2), interpolation='bilinear')(x)
x = Conv2D(2, (1, 1), padding='same')(x)

This is the decoder setup of your implementation. What's the benefit of using this, as opposed to use the original decoder from MobileNetV3Small?

Thanks

anilsathyan7 commented 3 years ago

Some layers at the end are not used in the architecture. Please refer the topics like finetuning, layer freezing etc. in deep learning for more information.

anilsathyan7 / Portrait-Segmentation

Question about aspect ratio of the input images #33

Shuffle, repeat, batch and prefetch

Batch and prefetch