Question about training and testing dataset format

jimmyfyx commented 1 year ago

Hi,

I wanna ask a few questions about training and testing dataset format as now I'm trying to use my own dataset. Suppose I just want to use the dataloader for VeedDynamic dataset, what should the dataset format look like? Specifically:

I notice for each video there is a PoseData.json, and for each sequence of a video there is also a PoseData.json. I wonder what is the difference between these two? Also, how can I interpret the json file? For example, what are Base, Base.001, and so on?
For TransformationMatrices.csv and CameraIntrinsicMatrix.json, how are matrices represented? For example, is a matrix flattened by row or by column?
What is the use of InterpolationData.json?
From src/data_loaders/VeedDynamic01.py it seems that I only need to provide RGB image .png, depth information .exr, and transformation matrices .csv to train the model? But in src/flow_estimation/data_loaders/VeedDyanmic01.py I see we also need RGB and depth image in .npy format?
Optical flow is not needed if we don't want to load ground truth flow right? How about surface normals?
Can the model be trained with different frame resolution? I see there is a get_frame_resolution function the dataloader, but I'm not sure does that mean it can read frames with any resolution?

Thanks so much for the reply!

NagabhushanSN95 commented 1 year ago

Hi,

PoseData.json contains pose information of all the objects in the scene (including camera). This is nowhere used in our code. We released other additional data with our dataset such as PoseData.json, surface normals, optical flow, etc in case somebody wants to use our dataset for sdome other purpose. If you want to run our model on a new dataset, you don't need this.
They're flattened by row i.e. first four elements of a row in TransformationMatrices.csv correspond to the first row. You can verify this by checking the last 4 element of TransformationMatrices.csv and 3 elements of CameraIntrinsicMatrix.json. They should be 0, 0, 0, 1 and 0, 0, 1 respectively.
InterpolationData.json contains how the object motions were interpolated between keyframes. Again, not needed to train our model.
We train our models on small patches from the images and not full images. So, instead of loading full image and then cropping them, we use numpy memory_map to load cropped data only. For this, the images/depth need to be saved as .npy files. You can either additionally save the images as npy files or modify the data_loader to load png images and crop them.
Nope. Neither of them are needed.
Yes, training the model with different resolutions should be possible. In continuation with point 4, we train our model on patches. As long as the frame size is larger than patch size, it should work properly. If you want to train with different patch sizes, that should also be possible - all our nertworks are fully convolutional. However, all the patches in a batch should be of same resolution - since pytorch does not support processing a batch with different resolutions. Same at test time. Our model should work for arbitrary resolutions.

Hope my answers resolved your queries. Please let me know if you have any follow-up questions.

PS: I appreciate the detailed questions and the work you've done before asking the questions

jimmyfyx commented 1 year ago

Thank you! A quick follow-up question on the point 4. Then what are the .exr files used for in the VeedDynamic dataset?

By the way, is there a recommendation of how many video sequences should be used for training? And is there an estimation of how long the model takes to train?

NagabhushanSN95 commented 1 year ago

Need for .exr files: We have three data-loaders: one each to train the flow estimation and infilling networks and one for final testing. The first two use .npy files, while the latter uses .png and .exr files. You ca easily modify either of them to use a consistent format. The reason for loading .png and .exr files instead of .npy files for the tester data-loader is to allow testing on any given video without having to convert them to .npy first.
How many videos? We use the pre-trained ARflow model and only finetune it to estimate flow betwen MPIs. This is a relatively easy task. So, about 500 full HD frames should be sufficient to learn it well. However, the inpainting network may need much more data. We didn't experiment more with the inpainting network. It's not clear how many is sufficient. On our dataset, it learnt well. So, I think 1500 full HD frames should be enough.
Training time: Both the models took about a day to train on our GPU. I've forgotten the details of the GPU, but it was 2-3 times slower than NVIDIA RTX 2080.

jimmyfyx commented 1 year ago

OK thanks!

NagabhushanSN95 commented 1 year ago

Closing this issue. Please reopen it if required.

NagabhushanSN95 / DeCOMPnet

Question about training and testing dataset format #4