Closed maskedmeerkat closed 4 years ago
The original scaling from full resolution to input resolution is done in:
which uses two scales. The line you mentioned scales from input resolution to the intermediate inverse depth resolutions, and these are smaller by factors of 2 in both dimensions, so using a single scale is correct. At least that's my understanding, does that make sense?
In any case, I really appreciate that you are going this deep into our codebase, please let me know if you find any other suspicious things, or if there any improvements we could try to make our numbers even better!
Ah, thanks, that makes sense. I am just really trying to get the self supervision for NuScenes to work. But I am more and more out of ideas.
Have you tried starting from a pretrained model from another dataset? That might give a better starting point for the depth features, so they don't diverge.
Yes, and my semisupervised learning already works okayish, as can be seen in the image. Even ground plane removal works okay. But when I used that semisupervised, fine tuned model and start selfsupervised training, it diverges immediately (4 GPUs, batch size 8, lr 1e-5). The input images are all correct. Thus, I am currently checking the camera intrinsic matrices as the last point of failure I could identify.
Another thing I will try is visualize the loss masks and maybe even the warping results themselves to see whether that is working as expected.
Did you manage to get it working?
hi @maskedmeerkat, I'm trying to get the depth map on nuScenes dataset too.
My approach is to use fully-supervised method(DORN) after generating a dense depth map using depth completion. I wonder if you evaluate the depth prediction performance quantitatively using metrics. (RMSE, theta...)
Here are results that I got from a front view image, using sparse GT. Abs Rel / Sq Rel / RMSE / RMSE_log / th_1.25 / th_1.252 / th_1.253 PackNet 0.187 / 1.852 / 7.636 / 0.289 / 0.742 DORN 0.132 / 1.598 / 6.944 / 0.233 / 0.839 / 0.938 / 0.972
It will be very helpful if you share the evaluation results of the semi-supervised method so I can try the other approach if yours is better.
Thank you!
Sadly, the above result is the best I can achieve. What I tried was
The best abs Rel I could achieve was 0.149 and you can see the results in my previous comment.
Do you know whether there should be so many pixels auto masked out? Is that to be expected in the beginning of training and becoming better over time?
I also tried to use the 384x640 resolution pretrained models, since you said something about to low quality or noise could affect the training. So I am unsure whether the stretching of the images due to reshaping has some influence on the accuracy...
Soon I am planning to spend some time refactoring the photometric loss, to try some new ideas, and will be able to introspect that some more. The auto-masking removes pixels with unwarped photometric loss smaller than warped photometric loss, so there might be something wrong with the pose network, that is not learning properly. Can you try turning auto-masking off during training?
I also tried that before with no real benefit. And I also belief, that the error is more on my side than on your implementation.
Hmm, could you give me a hint on how to evaluate that my camera intrinsics are provided in a way that fits your framework?
Hmm, anyways won't be able to work on improving depth estimation at the current time in my project. Maybe, in case I find some more time towards the end (highly unlikely XD), I can try some more thiings.
Hi Vitor,
in https://github.com/TRI-ML/packnet-sfm/blob/f824ffceba46ae1c621e1bf22a35634d8b39207c/packnet_sfm/losses/multiview_photometric_loss.py#L156-L157 you only provide one scale to the camera's scaling function. Wouldn't this mean that, in case my image isn't scaled equaly in x and y direction, the camera intrinsic matrix is scaled incorrectly or is this addressed somewhere else?
Thanks for your time and patience for explaining your code to all of us ^^