RaymondWang987 / NVDS

ICCV 2023 "Neural Video Depth Stabilizer" (NVDS) & TPAMI 2024 "NVDS+: Towards Efficient and Versatile Neural Stabilizer for Video Depth Estimation" (NVDS+)
MIT License
491 stars 24 forks source link

Finetune on real direct depth dataset and fine artifacts #14

Closed wzfsjtu closed 1 year ago

wzfsjtu commented 1 year ago

Thanks for your good job. I tried to finetune(or retrain) the nvds on other datasets like tartanair and mipi. I use the direct real depth as the input and find that the test result has many artifacts especially on the ground. I also use the predifined max depth(like 20 meter) to normalize the inputs but dont work. Are the inputs for nvds has to be relative inverse depths, or there is some tricks or attentions i need to focus. Thanks and look forwards to your replies!

RaymondWang987 commented 1 year ago

Thanks for your good job. I tried to finetune(or retrain) the nvds on other datasets like tartanair and mipi. I use the direct real depth as the input and find that the test result has many artifacts especially on the ground. I also use the predifined max depth(like 20 meter) to normalize the inputs but dont work. Are the inputs for nvds has to be relative inverse depths, or there is some tricks or attentions i need to focus. Thanks and look forwards to your replies!

Thanks for your attention to our work.

Please notice that the input of our NVDS is relative inverse depth, i.e., relative disparity (concatenated with RGB frames). Normalization is required to deal with the varied ranges of predictions from different depth predictors. If you want to use our NVDS or finetune our NVDS from our checkpoint, the inputs have to be relative inverse depth (keeping the same as our training manner). You can simply convert the real depth to relative inverse depth (relative disparity) by reciprocal and normalizing functions. If you want to directly input real depth to NVDS, you need to train NVDS from scratch without using our checkpoint, since our official checkpoint is trained with relative inverse depth as input.

We conducted experiments on Tartanair and IRS datasets (both are CG-Rendered data), and NVDS did not produce any artifacts. You should modify the input and carefully check your training scripts to improve the results.

wzfsjtu commented 1 year ago

Thanks for your good job. I tried to finetune(or retrain) the nvds on other datasets like tartanair and mipi. I use the direct real depth as the input and find that the test result has many artifacts especially on the ground. I also use the predifined max depth(like 20 meter) to normalize the inputs but dont work. Are the inputs for nvds has to be relative inverse depths, or there is some tricks or attentions i need to focus. Thanks and look forwards to your replies!

Thanks for your attention to our work.

Please notice that the input of our NVDS is relative inverse depth, i.e., relative disparity (concatenated with RGB frames). Normalization is required to deal with the varied ranges of predictions from different depth predictors. If you want to use our NVDS or finetune our NVDS from our checkpoint, the inputs have to be relative inverse depth (keeping the same as our training manner). You can simply convert the real depth to relative inverse depth (relative disparity) by reciprocal and normalizing functions. If you want to directly input real depth to NVDS, you need to train NVDS from scratch without using our checkpoint, since our official checkpoint is trained with relative inverse depth as input.

We conducted experiments on Tartanair and IRS datasets (both are CG-Rendered data), and NVDS did not produce any artifacts. You should modify the input and carefully check your training scripts to improve the results.

Thanks for your reply. I normalized continuous 4 depth frames as inputs. What should i deal with the output for loss calculation? I tried two methods: 1, re-normalied the output with 4 input-frames' min and max, 2, compute the scale and shift between output and target frame in 4 input-frames using the function compute_scale_and_shift, and do scale * output + shift And after that i used the charbonnier loss to calculate the loss of output and gt, receiving poor result.

Another comfusion is the split dataset. Should i split train-val-test dataset by videos(different video in different set) or frame sequences(sequences in same video may in same set)? Would the method influrence the result or generalization? I tried split mipi and tartanair by frame sequences and behaved poor in real scene.

RaymondWang987 commented 1 year ago

Thanks for your good job. I tried to finetune(or retrain) the nvds on other datasets like tartanair and mipi. I use the direct real depth as the input and find that the test result has many artifacts especially on the ground. I also use the predifined max depth(like 20 meter) to normalize the inputs but dont work. Are the inputs for nvds has to be relative inverse depths, or there is some tricks or attentions i need to focus. Thanks and look forwards to your replies!

Thanks for your attention to our work. Please notice that the input of our NVDS is relative inverse depth, i.e., relative disparity (concatenated with RGB frames). Normalization is required to deal with the varied ranges of predictions from different depth predictors. If you want to use our NVDS or finetune our NVDS from our checkpoint, the inputs have to be relative inverse depth (keeping the same as our training manner). You can simply convert the real depth to relative inverse depth (relative disparity) by reciprocal and normalizing functions. If you want to directly input real depth to NVDS, you need to train NVDS from scratch without using our checkpoint, since our official checkpoint is trained with relative inverse depth as input. We conducted experiments on Tartanair and IRS datasets (both are CG-Rendered data), and NVDS did not produce any artifacts. You should modify the input and carefully check your training scripts to improve the results.

Thanks for your reply. I normalized continuous 4 depth frames as inputs. What should i deal with the output for loss calculation? I tried two methods: 1, re-normalied the output with 4 input-frames' min and max, 2, compute the scale and shift between output and target frame in 4 input-frames using the function compute_scale_and_shift, and do scale * output + shift And after that i used the charbonnier loss to calculate the loss of output and gt, receiving poor result.

Another comfusion is the split dataset. Should i split train-val-test dataset by videos(different video in different set) or frame sequences(sequences in same video may in same set)? Would the method influrence the result or generalization? I tried split mipi and tartanair by frame sequences and behaved poor in real scene.

For the normalization of input disparity, please refer to our inference code, the training and inference are the same.

For the loss functions, you can refer to our paper. During the training, we utilized affinity-invariant loss as MiDaS along with the temporal loss. The affinity-invariant loss can effectively address the varied scale and shift of model predictions and multi-scene datasets. You can find the affinity-invariant loss with scale and shift alignment in the MiDaS GitHub or their issues.

Data splitting should be based on videos (different videos in different sets). For example, some videos are used for training, while the remaining ones are allocated for testing. Our data partitioning on NYUDV2 follows previous methods such as ST-CLSTM.

It's not the models that affect generalization, but rather the training data that significantly impacts it. For instance, our model on the VDW dataset exhibits strong generalization because the VDW dataset features large-scale and diverse scenes. We have conducted zero-shot testing on many high-quality natural scene datasets, including videos captured with our smartphones, and NVDS achieved satisfactory results.

If you are showing bad results on TartanAir: (1) There can be bugs and wrong issues in your current training code, considering our results on Tartanair and IRS are normal and satisfactory. I think your training code is not totally correct for now. (2) For the generalization, TartanAir is a synthetic dataset with limited data volume and scene diversity, which differs from real videos. For instance, TartanAir lacks moving subjects like people or vehicles, and its frame intervals are much larger than regular real videos. Therefore, models trained on TartanAir may not exhibit strong generalization to real scenes (experiments in our paper), which is attributed to the training data but not the models.

If you require our training code, you can send an email to me and specify your purposes. I can provide you with a raw version of the training code on NYUDV2. Many people have previously requested our training code by email, and I have prepared a raw version and shared it with them. Since I have not yet organized the parameter interfaces for releasing, I currently do not plan to release the training code in our github repo.

Additionally, if you need access to our dataset, you can apply through the dataset's official email.