Training stagnates and cannot predict fine-scale features

alexbarnett12 commented 3 years ago

After over a month of performing training experiments, moderately good results have been achieved. Below are some examples of good predictions: decent_results_train_image_0 decent_results_val_image_0 decent_results_val_image_1

However, across all images, the model is unable to predict fine-scale features, particularly in the depth range of 0 to 1.0 meters. This is bad since fine-scale features are actually the most important data to predict for obstacle navigation. For example, in the photo below, the model is able to predict nearly a single value across all pixels and still achieve a low RMSE, since the error between each pixel is small. decent_results_train_image_1

As a baseline, I trained on the NYU Depth dataset to try and replicate the results in the FastDepth paper. Quantitatively, the results were very good, with a test RMSE of .75 and delta1 of .7. However, the model still couldn't predict fine-scale objects: nyu_train_image_0

My intuition from these results is that more simulation data by itself won't solve the problem. I see a few options going forward:

Use a different loss function. We've already discussed this option, but I still don't have any good contenders. I have tried some log losses, but they've failed so far, so I may need to run those by someone.
Modify their network to be larger. From reading their paper, their encoder-decoder architecture was designed completely around FPS at the cost of accuracy. Seeing how we don't need 120FPS, we could potentially scale their architecture to have more upsampling and skip connections, and maybe more depth, to improve accuracy.
Another option similar to above is to train two networks - a coarse-scale and a fine-scale network. Eigen et al. did this back in 2014, and while there have been better architectures since then, for the most part those architectures focus on consistently-scaled environments.
Do some postprocessing magic and apply sequential similarity constraints to refine data. This wouldn't increase resolution, but it would improve consistency between frames. I think I'll be doing this for whatever final model we have.

My first steps going forward will be to increase our training dataset and try some different loss functions. If that leads nowhere, then I will revisit some of the paths above.

finger563 commented 3 years ago

Thanks for the detailed writeup @alexbarnett12 👍

more training data is definitely good (though we're running out of space)
loss functions should definitely help - have @guo-max look through what you've done and see if you two can come up with a few new experiments to run - perhaps a new loss function applied when training a checkpoint from our previous good model?
modifying the model is a possibility - but should be one of our last choices - as we will likely want to push as much performance as we can.
multi-network pipelines and post-processing is likely not viable to solve some of the problems we're seeing, but it's good to keep them in mind and explore them a little bit (though not necessarily develop any of them just yet)

finger563 commented 3 years ago

Based on our discussion today:

predicting inverse depth (which is what game engines leverage for better rendering precision anyway) could help with this issue - then we simply use L1 or L2 loss against the prediction and the inverse ground truth
instead of the above, we can convert our loss function to do L1 or L2 loss between the inverse prediction and the inverse ground truth

These two options (which should be roughly equivalent (though perhaps the first option is better numerically) should help us considerably with close depth values.

However, it may run into issues since the loss at depths larger than 1 meter will be less than 1. We may want to not use inverse strictly as 1 / x but instead have some scalar value so that it is more like 10 / x where 0-10 is the range of depth we are primarily interested in.

alexbarnett12 commented 3 years ago

Another idea would be to add a component to the loss function that is based off the delta1 metric, which is the percentage of pixels with less than 20% error. It's a good measure of if the error is spread across the entire image (like it currently is), or if there are just some outlier pixels (not ideal, but better than alternative). I don't know if this would work as a loss function by itself, but it could potentially be calculated in conjunction with L1.

appliedinnovation / fast-depth

Training stagnates and cannot predict fine-scale features #2