Multi GPU Training - Githubissues

tharinduk90 commented 3 years ago

I want to do multi_view_reconstruction for real images. so far i have got good result. But now i want to speed up the training using multi gpu. Because for complex models it take more than 10 hours with my local pc.(6 gb gpu memory pc).

For the multi gpu training i added "multi_gpu: true" for the config.yaml(ours_depth_mvs.yaml) . I used p3.8xlarge(4 gpus ,with each having 16 gb memory) from aws for multi gpu testing.The config file is as follows.

data: path: data/DTU ignore_image_idx: [] classes: ['scan244'] dataset_name: DTU n_views: 51 input_type: null train_split: null val_split: null test_split: null cache_fields: True split_model_for_images: true depth_range: [0., 1400.] img_extension: png img_extension_input: jpg depth_extension: png mask_extension: png model: c_dim: 0 encoder: null patch_size: 2 lambda_image_gradients: 1. lambda_depth: 1. lambda_normal: 0.1 training: out_dir: out/multi_view_reconstruction/angel/ours_depth_mvs n_training_points: 2048 n_eval_points: 8000 model_selection_metric: mask_intersection model_selection_mode: maximize batch_size: 1 batch_size_val: 1 scheduler_milestones: [3000, 5000] scheduler_gamma: 0.5 depth_loss_on_world_points: True validate_every: 5000 visualize_every: 10000 multi_gpu: true generation: upsampling_steps: 4 refinement_step: 30

But when i check the usage of the gpus, the result is as follows.

Capture

only the gpu:0 is used.

i have checked https://github.com/autonomousvision/differentiable_volumetric_rendering/issues/9.

1) Can u help me with multi gpu training ? Can u provide guidance how can i achieve it? 2) Can we increase the batch_size, batch_size_val more than one for the multi_view_reconstruction ?

m-niemeyer commented 3 years ago

Hi @tharinduk90, thanks for your interest in the project!

Unfortunately, we have not thoroughly tested multi GPU training as we never used it - we always did single-GPU training. As you already mentioned, this issue might be interesting for achieving results faster with less memory consumption. Regarding the batch sizes in the multi-view reconstruction experiments, these indicate now the number of images which are sampled - in the single-view reconstruction experiments, the batch size defines the number of objects to sample. This is set with the `split_model_for_images' argument, e.g. here.

Good luck with your research!

tharinduk90 commented 3 years ago

@m-niemeyer , thank you very much for your reply,

in the multi view experiment , if i set the batch_size =2 , batch_size_val =2 , it will give following error. (since batch size refers to number of images sampled , i expect this to work) Is there any thing that i'm doing wrong? Can u help me with this ?

Traceback (most recent call last): File "train.py", line 129, in loss = trainer.train_step(batch, it) File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 112, in train_step loss = self.compute_loss(data, it=it) File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/training.py", line 405, in compute_loss p_world_hat_sparse, mask_pred_sparse, normals) = self.model( File "/home/liveroom/anaconda3/envs/dvr/lib/python3.8/site-packages/torch/nn/modules/module.py", line 532, in call result = self.forward(*input, **kwargs) File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 83, in forward normals = self.get_normals(p_world.detach(), mask_pred, c=c) File "/home/liveroom/3d_reconstruction/dvr/differentiable_volumetric_rendering/im2mesh/dvr/models/init.py", line 117, in get_normals c = c.unsqueeze(1).repeat(1, points.shape[1], 1)[mask] IndexError: The shape of the mask [2, 1024] at index 0 does not match the shape of the indexed tensor [1, 1024, 0] at index 0

autonomousvision / differentiable_volumetric_rendering

Multi GPU Training #42