Question: choice of input resolution

dfrumkin commented 4 years ago

I am wondering how you arrived at the input resolution of 640x384, which is a 5:3 aspect ratio. I saw in other papers, they work with flexible aspect ratios; for example, in MiDaS the longer side's length is 384 and the other side's length is divisible by 32. Sometimes, it's a square, or at least the model is trained on squares. What were your considerations for your choice? What if my image is vertical, e.g. 9:16, - do you think the result will be adversely affected by its resizing?

FilippoAleotti commented 4 years ago

That is a good point. Actually, I trained at 640x320 (I used 640x384 just on devices to be a little more similar to the mobile aspect ratio). If I remember well MiDaS was trained at 384x384, but I noticed that for images with very different aspect ratio the results seem better changing the prediction shape than preserving the training one (e.g., I’ve obtained better results on KITTI using 1024x320 than using 384x384). In my experiments I resized images to 640x320 before applying MiDaS at full image resolution (i.e., 640x320, because I was using that shape for other experiments), discarding those images with height >= width, but you can apply even more strict constraints if you want to preserve aspect ratio.

dfrumkin commented 4 years ago

Thank you for your answer, Filippo! I believe it's possible to create an mlmodel that supports multiple input sizes. Then you could avoid distorting images too much by using square crops for training and proportionate resizing during inference just as MiDaS did. Wouldn't that be better?

FilippoAleotti commented 4 years ago

Yes, you can try! In my experience, training monocular networks with crops reduce the image context and make the training more difficult, but I haven’t try in this particular case yet, so it may be beneficial.

dfrumkin commented 4 years ago

Just to clarify. By "training monocular networks with crops", do you mean:

take a crop, generate the "ground truth" by running MiDaS on the crop, or
take an image with ground truth and crop both.

I am a bit surprised that crops would matter in the first case because initial crops are rather arbitrary (or are they not and are somehow better for segmentation, i.e. contain whole objects, and depth is related to segmentation)?

FilippoAleotti commented 4 years ago

I was thinking about the second one, but actually in my opinion the problem is the same: consider for instance a large image (let’s say a 1920x1080), if you take a smaller crop (e.g. 384x384) you reduce the context of the image (e.g. you may take texture-less region, as a wall, in the larger picture). This is harder for the teacher too, while in the second case just for the student, because the teacher would see the full image. However, second approach may work with absolute depths, but with relative depths you have to pay attention because the two networks will see different things (full image the teacher, just the crop the student), so their predictions will be different (I guess it should be better to rescale properly teacher’s predictions). Of course it depends on crop size, and it would be interesting to see the difference with my approach! Suggestions are welcome!

dfrumkin commented 4 years ago

But then you could crop 1080x1080 and then resize it to 384x384. To quote the MiDaS paper:

Images are flipped horizontally with a 50% chance, and randomly cropped and resized to 384×384 to augment the data and maintain the aspect ratio across different input images.

The random part is about augmentation. i.e. instead of taking the central crop, they may do something else, but the cropped square is (almost) the size of the original image. Something along the lines of the "standard" approach: https://pytorch.org/tutorials/beginner/transfer_learning_tutorial.html#load-data

Then both the teacher and the student work with the same images which contain most of the information in the original images.

FilippoAleotti commented 4 years ago

In my opinion resizing should affect more the teacher than the student, because the student “is used to see” resized images so at testing time it should be fine, but teacher’s predictions seem good even with resize. However, I agree that your approach may help, so I have to try and see the impact of this decision. Thank you for your suggestions!

dfrumkin commented 4 years ago

Thank you for your answers, Filippo!

FilippoAleotti / mobilePydnet

Question: choice of input resolution #12