NVIDIA / flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Other
3.09k stars 739 forks source link

Multi-GPU seems to skip data #224

Closed Senwang98 closed 2 years ago

Senwang98 commented 3 years ago

When I use multi-gpu to inference, the output of tqdm seems to be wrong!

tqdm outputs for single GPU: Inference Averages for Epoch 0: L1: 8.475, EPE: 14.679: 100%|█████| 50/50.0 [00:22<00:00, 2.53it/s] tqdm outputs for 2 GPU: Inference Averages for Epoch 0: L1: 8.475, EPE: 14.679: 100%|█████| 25/25.0 [00:25<00:00, 2.15it/s]

How to sovle this question when using multi-gpu?

WormPartner commented 3 years ago

hello, have you solved this problem? I got the same

Senwang98 commented 3 years ago

Sorry, I have no idea....

kyle-sama commented 3 years ago

So the problem stems from this line: inference_loader = DataLoader(inference_dataset, batch_size=args.effective_inference_batch_size, shuffle=False, **inf_gpuargs) Where: args.effective_batch_size = args.batch_size * args.number_gpus

If I have 59 usable frames, if number of gpu is set to 2, the length of inference_loader is 29, if number of gpu is set to 1, it comes out as 59. The workaround is to only use 1 gpu, still not sure where the bug is, as it seems to be going through a Torch class... At the very least this is why you're getting data skipped.

JosephKKim commented 2 years ago

solved this issue doing similar thing that @kyle-sama did... but still can't under stand mechanism clearly... THX @kyle-sama

scpsc commented 2 years ago

See this fix: https://github.com/NVIDIA/flownet2-pytorch/issues/106#issuecomment-670935247