NVIDIA / flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks
Other
3.09k stars 739 forks source link

Training error: TypeError: rsub() received an invalid combination of arguments - got (Tensor, tuple) #191

Open kashyap92 opened 4 years ago

kashyap92 commented 4 years ago

Hello all, I am training on flownet2C with MPI sintel data and i am running into this issue: TypeError: rsub() received an invalid combination of arguments - got (Tensor, tuple), but expected one of:

Complete error is :

File "main.py", line 426, in t] train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger , offset=offset) 0%| | 0/130.0 [00:00<?, ?it/s] File "main.py", line 267, in train losses = model(data[0], target[0]) File "/datasets/Mahesh_kashyap/Opical_flow/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, kwargs) File "/datasets/Mahesh_kashyap/project/Opical_flow/env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 141, in forward return self.module(*inputs[0], *kwargs[0]) File "/datasets/Mahesh_kashyap/project/Opical_flow/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, kwargs) File "main.py", line 172, in forward loss_values = self.loss(output, target) File "/datasets/Mahesh_kashyap/Optical_flow/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(*input, *kwargs) File "/datasets/Mahesh_kashyap/Optical_flow/flownet2-pytorch/losses.py", line 37, in forward lossvalue = self.loss(output, target) File "/datasets/Mahesh_kashyap/Optical_flow/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 489, in call result = self.forward(input, **kwargs) File "/datasets/Mahesh_kashyap/Optical_flow/flownet2-pytorch/losses.py", line 18, in forward lossvalue = torch.abs(output - target).mean() File "/datasets/Mahesh_kashyap/Optical_flow/env/lib/python3.6/site-packages/torch/tensor.py", line 363, in rsub return _C._VariableFunctions.rsub(self, other) ypeError: rsub() received an invalid combination of arguments - got (Tensor, tuple), but expected one of:

CLI i am using is as follows: CUDA_VISIBLE_DEVICES=1 python main.py --batch_size 8 --model FlowNet2C --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-10 \ --training_dataset MpiSintelFinal --training_dataset_root /datasets/Mahesh_kashyap/Optical_flow/flownet2-pytorch/MPI-Sintel/flow/training \ --validation_dataset MpiSintelClean --validation_dataset_root /datasets/Mahesh_kashyap/Optical_flow/flownet2-pytorch/MPI-Sintel/flow/training

Any help would be much appreciated.

Thank you.

Queenyy commented 3 years ago

hello, I get the same error when training FlowNet2C. How did you solve this problem?

Ahleroy commented 3 years ago

Hi, With FlowNet2C you should use the MultiScale norm (--loss=MultiScale --loss_norm=L1). Using directly the L1 is for the FlowNet2 network.

ryngworks commented 3 years ago

What command should we use for FlowNet2S?

I ran this command and got the following error:

Command:

!CUDA_AVAILABLE_DEVICES=0 python main.py --total_epochs 100 --batch_size 8 --model FlowNet2S --loss=MultiScale \
--loss_norm=L1 --optimizer=Adam --optimizer_lr=1e-4 --skip_validation --crop_size 128 128 \
--training_dataset MpiSintelFinal \
--training_dataset_root /content/drive/MyDrive/fyp/MPI-Sintel-complete/training \
--resume /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar \
--save /content/drive/MyDrive/checkpoints

Error:

Parsing Arguments
  [0.033s] batch_size: 8
  [0.033s] crop_size: [128, 128]
  [0.033s] fp16: False
  [0.033s] fp16_scale: 1024.0
  [0.033s] gradient_clip: None
  [0.033s] inference: False
  [0.033s] inference_batch_size: 1
  [0.033s] inference_dataset: MpiSintelClean
  [0.033s] inference_dataset_replicates: 1
  [0.033s] inference_dataset_root: ./MPI-Sintel/flow/training
  [0.033s] inference_n_batches: -1
  [0.033s] inference_size: [-1, -1]
  [0.033s] inference_visualize: False
  [0.033s] log_frequency: 1
  [0.033s] loss: MultiScale
  [0.033s] loss_l_weight: 0.32
  [0.033s] loss_norm: L1
  [0.033s] loss_numScales: 5
  [0.033s] loss_startScale: 4
  [0.033s] model: FlowNet2S
  [0.033s] model_batchNorm: False
  [0.033s] model_div_flow: 20
  [0.033s] name: run
  [0.033s] no_cuda: False
  [0.033s] number_gpus: 1
  [0.033s] number_workers: 8
  [0.033s] optimizer: Adam
  [0.033s] optimizer_amsgrad: False
  [0.033s] optimizer_betas: (0.9, 0.999)
  [0.033s] optimizer_eps: 1e-08
  [0.033s] optimizer_lr: 0.0001
  [0.033s] optimizer_weight_decay: 0
  [0.033s] render_validation: False
  [0.033s] resume: /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar
  [0.033s] rgb_max: 255.0
  [0.033s] save: /content/drive/MyDrive/checkpoints
  [0.033s] save_flow: False
  [0.033s] schedule_lr_fraction: 10
  [0.033s] schedule_lr_frequency: 0
  [0.033s] seed: 1
  [0.033s] skip_training: False
  [0.033s] skip_validation: True
  [0.033s] start_epoch: 1
  [0.033s] total_epochs: 100
  [0.033s] train_n_batches: -1
  [0.034s] training_dataset: MpiSintelFinal
  [0.034s] training_dataset_replicates: 1
  [0.034s] training_dataset_root: /content/drive/MyDrive/fyp/MPI-Sintel-complete/training
  [0.034s] validation_dataset: MpiSintelClean
  [0.034s] validation_dataset_replicates: 1
  [0.034s] validation_dataset_root: ./MPI-Sintel/flow/training
  [0.034s] validation_frequency: 5
  [0.034s] validation_n_batches: -1
  [0.037s] Operation finished

Source Code
  Current Git Hash: b'00cff7e3c07547ecdfa1b3314252963a36e705ec'

Initializing Datasets
  [0.367s] Training Dataset: MpiSintelFinal
  [0.413s] Training Input: [3, 2, 128, 128]
  [0.454s] Training Targets: [2, 128, 128]
  [0.454s] Operation finished

Building FlowNet2S model
  [0.471s] Effective Batch Size: 8
  [0.471s] Number of parameters: 38676506
  [0.471s] Initializing CUDA
  [4.738s] Parallelizing
  [4.738s] Loading checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar'
  [4.863s] Loaded checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar' (at epoch 0)
  [4.863s] Initializing save directory: /content/drive/MyDrive/checkpoints
  [4.866s] Operation finished

Initializing Adam Optimizer
  [0.000s] amsgrad = False (<class 'bool'>)
  [0.000s] weight_decay = 0 (<class 'int'>)
  [0.000s] eps = 1e-08 (<class 'float'>)
  [0.000s] betas = (0.9, 0.999) (<class 'tuple'>)
  [0.000s] lr = 0.0001 (<class 'float'>)
  [0.000s] Operation finished

Overall Progress:   0%|                                                     | 0/101 [00:00<?, ?it/s]
Training Epoch 0:   0%|                                                                       | 0/130.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 439, in <module>
    train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger=train_logger, offset=offset)
  File "main.py", line 294, in train
    loss_val.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function MulBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.FloatTensor but got torch.FloatTensor
Bigfishering commented 1 year ago

What command should we use for FlowNet2S?

I ran this command and got the following error:

Command:

!CUDA_AVAILABLE_DEVICES=0 python main.py --total_epochs 100 --batch_size 8 --model FlowNet2S --loss=MultiScale \
--loss_norm=L1 --optimizer=Adam --optimizer_lr=1e-4 --skip_validation --crop_size 128 128 \
--training_dataset MpiSintelFinal \
--training_dataset_root /content/drive/MyDrive/fyp/MPI-Sintel-complete/training \
--resume /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar \
--save /content/drive/MyDrive/checkpoints

Error:

Parsing Arguments
  [0.033s] batch_size: 8
  [0.033s] crop_size: [128, 128]
  [0.033s] fp16: False
  [0.033s] fp16_scale: 1024.0
  [0.033s] gradient_clip: None
  [0.033s] inference: False
  [0.033s] inference_batch_size: 1
  [0.033s] inference_dataset: MpiSintelClean
  [0.033s] inference_dataset_replicates: 1
  [0.033s] inference_dataset_root: ./MPI-Sintel/flow/training
  [0.033s] inference_n_batches: -1
  [0.033s] inference_size: [-1, -1]
  [0.033s] inference_visualize: False
  [0.033s] log_frequency: 1
  [0.033s] loss: MultiScale
  [0.033s] loss_l_weight: 0.32
  [0.033s] loss_norm: L1
  [0.033s] loss_numScales: 5
  [0.033s] loss_startScale: 4
  [0.033s] model: FlowNet2S
  [0.033s] model_batchNorm: False
  [0.033s] model_div_flow: 20
  [0.033s] name: run
  [0.033s] no_cuda: False
  [0.033s] number_gpus: 1
  [0.033s] number_workers: 8
  [0.033s] optimizer: Adam
  [0.033s] optimizer_amsgrad: False
  [0.033s] optimizer_betas: (0.9, 0.999)
  [0.033s] optimizer_eps: 1e-08
  [0.033s] optimizer_lr: 0.0001
  [0.033s] optimizer_weight_decay: 0
  [0.033s] render_validation: False
  [0.033s] resume: /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar
  [0.033s] rgb_max: 255.0
  [0.033s] save: /content/drive/MyDrive/checkpoints
  [0.033s] save_flow: False
  [0.033s] schedule_lr_fraction: 10
  [0.033s] schedule_lr_frequency: 0
  [0.033s] seed: 1
  [0.033s] skip_training: False
  [0.033s] skip_validation: True
  [0.033s] start_epoch: 1
  [0.033s] total_epochs: 100
  [0.033s] train_n_batches: -1
  [0.034s] training_dataset: MpiSintelFinal
  [0.034s] training_dataset_replicates: 1
  [0.034s] training_dataset_root: /content/drive/MyDrive/fyp/MPI-Sintel-complete/training
  [0.034s] validation_dataset: MpiSintelClean
  [0.034s] validation_dataset_replicates: 1
  [0.034s] validation_dataset_root: ./MPI-Sintel/flow/training
  [0.034s] validation_frequency: 5
  [0.034s] validation_n_batches: -1
  [0.037s] Operation finished

Source Code
  Current Git Hash: b'00cff7e3c07547ecdfa1b3314252963a36e705ec'

Initializing Datasets
  [0.367s] Training Dataset: MpiSintelFinal
  [0.413s] Training Input: [3, 2, 128, 128]
  [0.454s] Training Targets: [2, 128, 128]
  [0.454s] Operation finished

Building FlowNet2S model
  [0.471s] Effective Batch Size: 8
  [0.471s] Number of parameters: 38676506
  [0.471s] Initializing CUDA
  [4.738s] Parallelizing
  [4.738s] Loading checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar'
  [4.863s] Loaded checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar' (at epoch 0)
  [4.863s] Initializing save directory: /content/drive/MyDrive/checkpoints
  [4.866s] Operation finished

Initializing Adam Optimizer
  [0.000s] amsgrad = False (<class 'bool'>)
  [0.000s] weight_decay = 0 (<class 'int'>)
  [0.000s] eps = 1e-08 (<class 'float'>)
  [0.000s] betas = (0.9, 0.999) (<class 'tuple'>)
  [0.000s] lr = 0.0001 (<class 'float'>)
  [0.000s] Operation finished

Overall Progress:   0%|                                                     | 0/101 [00:00<?, ?it/s]
Training Epoch 0:   0%|                                                                       | 0/130.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 439, in <module>
    train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger=train_logger, offset=offset)
  File "main.py", line 294, in train
    loss_val.backward()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/usr/local/lib/python3.6/dist-packages/torch/autograd/__init__.py", line 93, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: Function MulBackward0 returned an invalid gradient at index 1 - expected type torch.cuda.FloatTensor but got torch.FloatTensor

i met the same problem,have u fixed it out?expect your reply