ardaduz / deep-video-mvs

Code for "DeepVideoMVS: Multi-View Stereo on Video with Recurrent Spatio-Temporal Fusion" (CVPR 2021)
MIT License
218 stars 29 forks source link

Data preparation #3

Closed phongnhhn92 closed 3 years ago

phongnhhn92 commented 3 years ago

Hi, Thanks for your code! I think the result is very good.

I wanna test the performance of ScanNet dataset by using the training code. Can you provide a script to prepare the ScanNet dataset for training?

ardaduz commented 3 years ago

Hi, I should mention this in the README definitely. I provide the scripts to parse the datasets in the repo under "dataset". They might not work out of the box due to naming and foldering conventions used while downloading the datasets, except the ScanNet export. The script for exporting ScanNet .sens files should work with very minimal effort. Here is the script, it is a slightly modified version of the official script. So, it requires python2. Overall, the difference of training set structure from the test set structure, the color image and the depth image for a timestep is packed into one .npz to preserve some space. And while exporting test data, I set frame_skip=1, and while exporting the training data, frame_skip=4 since the amount of data is quite a lot. When exporting the train data, the script randomly generates the train/validation split of the unique scenes. I also added the splits that are used during the project to the repo: "train.txt", "validation.txt". You can replace the automatically generated ones with these two if you want.

ardaduz commented 3 years ago

I've updated the fusionnet training script and also added the script for training the pairnet. If you want to train the networks from scratch yourself, please see the detailed explanation of the procedure we follow, given in the supplementary of the paper.

phongnhhn92 commented 3 years ago

Thanks for your quick comment ! I will try to train your model.

phongnhhn92 commented 3 years ago

Just a quick question: After preparing ScanNet dataset, I can train your fusion model from scratch using the file run-training.py in the dvmvs/fusionnet folder right ?

ardaduz commented 3 years ago

If your goal is to reproduce my training exactly or at least get close to the results, you need to first train the pairnet for one epoch from scratch. Take the weights only for feature extractor, feature pyramid and encoder, and place it under dvmvs/fusionnet/weights. Then you can run the training for fusionnet up to 1000K iterations while warping the hidden state with the groundruth depth maps. This should give already good results. For further slight improvement, you need to finetune the cell starting from the best checkpoint while warping the hidden states with predictions.

phongnhhn92 commented 3 years ago

Okay I see ! So basically, threre are 3 steps to reproduce your training: 1) Train PairNet from scratch for one epoch. 2) Fintune the trained feature extractor, feature pyramid and encoder with fusionnet until end. 3) Fintuning cell states of convLSTM with warping the hidden states. Is that correct ?

ardaduz commented 3 years ago
  1. Train pairnet from scratch for one epoch.
  2. Load the pairnet weights of ONLY feature extractor, feature pyramid and encoder to the fusionnet, i.e., CLEAR the weights folder of fusionnet, put only these three there and run the fusionnet training script. The script is already prepared to first train ONLY lstm and decoder, then add the rest gradually. You should expect to run the training for at least several hundreds of thousands of iterations (a week or so depending on the GPU), up to 1000K iterations. And assuming that you're using the splits that I provide, you can expect to get at least a validation L1 loss around ~0.125 meters, validation L1-inv loss around ~0.36, may be better as well. I can't give you an exact number due to inherent race-condition in multi-process crawling of the data folders and a random seed I had to introduce due to some infrastructural requirements.
  3. If you further want to improve like we discuss in the paper, load all of the weights from the checkpoint with the best validation score. Freeze all weights, except the lstm. Then, adjust the training code, so that you .detach() the previous prediction from the autograd, and use the previous depth prediction while warping the hidden states like testing.
phongnhhn92 commented 3 years ago

Thanks, @ardaduz, I am in the process of 1st step training from scratch. Is that possible for you to provide your full trained model for comparison (at least for ScanNet dataset).

ardaduz commented 3 years ago

All models are trained only on ScanNet dataset. This is the only training set we use. I am not sure what you mean by full trained model, because these weights are the weights acquired after performing all the steps.

phongnhhn92 commented 3 years ago

@ardaduz I didnt notice that you have updated the trained weights in the repo. Sorry for that.

It seems like it might take a while for my computer to process all training and testing scenes of ScanNet. In the meanwhile, I am reading your code to understand more about your model. I have a quick question related to Cost Volume Construction.

In this subsection, you perform pixel -wise correlation between the reference feature map F and warped measurement feature map F^m. However, I found it confusing when I look at Figure 2. The output feature of the feature pyramid network is F with the size of H/2 x W/2 x CH. Then the output of the Plane Sweep Warp is a volume with the size of H/2 x W/2 x M. I guess M is the number of depth planes . But how did you get rid of the CH channel?

In most of the cases with plane sweep volume, the output volume should be H/2 x W/2 x CH x M. What am I missing here?

phongnhhn92 commented 3 years ago

Also in this section, why d_near is small than d_far. Do you mean disparity depth in this case? If it is disparity depth then 0.25 and 0.2 are very narrow depth ranges. How doesn't this range work on multiple datasets?

image

ardaduz commented 3 years ago

Your last comment is just an unfortunate typo, thank you for pointing it out. It should be d_near = 0.25 meters and d_far = 20 meters. The code is correct of course and I will correct the paper as well.

About the cost volume calculation, this is actually a research question and you may have missed the related work in the paper. To repeat the paper a bit, we directly compute a 3D cost volume with a predetermined cost metric based on dot product < . , . > of feature vectors (length of CH) more like traditional computer vision approaches. This can be processed with 2D convolutions. What you give as an example is building 4D feature volumes without decimating the feature dimension which is usually processed with 3D convolutions in learning-based MVS and computationally demanding and often causes high inference times. Of course 4D feature volumes have advantages and shown to be producing highly accurate depth predictions in many cases. However, we aim for a potentially real-time system, so the design choices are made accordingly.

phongnhhn92 commented 3 years ago

Your last comment is just an unfortunate typo, thank you for pointing it out. It should be d_near = 0.25 meters and d_far = 20 meters. The code is correct of course and I will correct the paper as well.

About the cost volume calculation, this is actually a research question and you may have missed the related work in the paper. To repeat the paper a bit, we directly compute a 3D cost volume with a predetermined cost metric based on dot product < . , . > of feature vectors (length of CH) more like traditional computer vision approaches. This can be processed with 2D convolutions. What you give as an example is building 4D feature volumes without decimating the feature dimension which is usually processed with 3D convolutions in learning-based MVS and computationally demanding and often causes high inference times. Of course 4D feature volumes have advantages and shown to be producing highly accurate depth predictions in many cases. However, we aim for a potentially real-time system, so the design choices are made accordingly.

Hi, I think the range [0.25, 20] is reasonable for me. Btw, can you quickly show me the place where you implemented this cost volume construction in your codebase?

ardaduz commented 3 years ago

Yes, that is the range. Cost volume calculation function: https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/utils.py#L45

phongnhhn92 commented 3 years ago

Yes, that is the range. Cost volume calculation function: https://github.com/ardaduz/deep-video-mvs/blob/master/dvmvs/utils.py#L45

I understand it now. So you perform dot product feature maps of the renference frame and each projected feature maps of the measurement and then sum and div by the number of channels. Anyways, this is a nice trick and I guess it performs really well compared to 4D feature volumes. Did anyone use this way to construct cose volume before or your method is the first to do this ?

I have been able to train your model from scratch now. It is not yet finished but I think it works :D Merry Christmas !

ardaduz commented 3 years ago

Glad to hear that training works.

Using correlation (dot product) for matching extracted features is among the well-known, standard similarity measures like L1 distance, L2 distance, cosine distance, etc. In terms of accuracy, as I said in my previous comment, 4D feature volumes may produce more accurate depth predictions but generally slow to process with 3D convolutions. As we aim for real-time applications and lightweight warping operation at the bottleneck while processing video streams, we opt for such 3D cost volumes. If you're interested in an overview of different approaches, this paper is a survey on several different methods: https://arxiv.org/pdf/1906.06113.pdf