Training on own dataset?

ryngworks commented 3 years ago

Hi there, I am trying to use your notebook to train on my own dataset. The dataset consists of samples where each sample consists of a single image pair and their respective ground truth flow. I have been trying to make it work on the notebook but it seems to be unable to. Please advise.

Gauravv97 commented 3 years ago

Hi @ryannggy, the official repo has the following command for training on Sintel data

python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4 \ --training_dataset MpiSintelFinal --training_dataset_root /path/to/mpi-sintel/final/dataset \ --validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset

I guess the best way to achieve custom training would be to replace the Sintel data with your own data and flo files.

Unlike inference there isn't an option for training with our own datasets with argument ImagesFromFolder. This does not fetch flow files in the datasets.py file for ImagesFromFolder class.

You can try to modify the this function (take reference from other classes above) for fetching flow files in getitem() and hope it works.

Please post your solution below if something works for you.

ryngworks commented 3 years ago

Actually that is exactly what i did, to replace the img and flo files in MpiSintel with my own. I managed to train the network with MpiSintel, but somehow does not work when i replace it with my own (error in the tensor dimensions). I will try to modify the getitem() method. Will update. Thank you

Gauravv97 commented 3 years ago

@ryannggy , Also make sure the dimensions of your images, flow files are same. Usually the preferred size (h &w) would be a multiple of 64, because the generated flow will also be a multiple of 64. This is already handled in the MPI sintel class using crop.

ryngworks commented 3 years ago

I have managed to solve this. But I have another issue, I have been trying to save the checkpoints after training for i.e. 100 epochs but the checkpoint is nowhere to be found (in google drive or in colab drive). How should I run the command to save the checkpoint? Another thing is, I can't seem to make the notebook work for Flownet2S version, is it possible to just use Flownet2S?

For the FlowNet2S issue, this is what i tried:

!CUDA_AVAILABLE_DEVICES=0 python main.py --total_epochs 15 --batch_size 8 --model FlowNet2S --loss=L1Loss \
--optimizer=Adam --optimizer_lr=1e-4 --skip_validation --crop_size 128 128 --training_dataset MpiSintelFinal \
--training_dataset_root /content/drive/MyDrive/fyp/MPI-Sintel-complete/training \
--resume /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar \
--save ./content/drive/MyDrive/checkpoints

This is the error I got:

Parsing Arguments
  [0.034s] batch_size: 8
  [0.034s] crop_size: [128, 128]
  [0.034s] fp16: False
  [0.034s] fp16_scale: 1024.0
  [0.034s] gradient_clip: None
  [0.034s] inference: False
  [0.034s] inference_batch_size: 1
  [0.034s] inference_dataset: MpiSintelClean
  [0.034s] inference_dataset_replicates: 1
  [0.034s] inference_dataset_root: ./MPI-Sintel/flow/training
  [0.034s] inference_n_batches: -1
  [0.034s] inference_size: [-1, -1]
  [0.034s] inference_visualize: False
  [0.034s] log_frequency: 1
  [0.034s] loss: L1Loss
  [0.034s] model: FlowNet2S
  [0.034s] model_batchNorm: False
  [0.034s] model_div_flow: 20
  [0.034s] name: run
  [0.034s] no_cuda: False
  [0.034s] number_gpus: 1
  [0.034s] number_workers: 8
  [0.034s] optimizer: Adam
  [0.034s] optimizer_amsgrad: False
  [0.034s] optimizer_betas: (0.9, 0.999)
  [0.034s] optimizer_eps: 1e-08
  [0.034s] optimizer_lr: 0.0001
  [0.034s] optimizer_weight_decay: 0
  [0.034s] render_validation: False
  [0.034s] resume: /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar
  [0.034s] rgb_max: 255.0
  [0.034s] save: ./content/drive/MyDrive/checkpoints
  [0.034s] save_flow: False
  [0.034s] schedule_lr_fraction: 10
  [0.034s] schedule_lr_frequency: 0
  [0.034s] seed: 1
  [0.034s] skip_training: False
  [0.034s] skip_validation: True
  [0.034s] start_epoch: 1
  [0.034s] total_epochs: 15
  [0.034s] train_n_batches: -1
  [0.034s] training_dataset: MpiSintelFinal
  [0.034s] training_dataset_replicates: 1
  [0.034s] training_dataset_root: /content/drive/MyDrive/fyp/MPI-Sintel-complete/training
  [0.034s] validation_dataset: MpiSintelClean
  [0.034s] validation_dataset_replicates: 1
  [0.034s] validation_dataset_root: ./MPI-Sintel/flow/training
  [0.034s] validation_frequency: 5
  [0.034s] validation_n_batches: -1
  [0.037s] Operation finished

Source Code
  Current Git Hash: b'00cff7e3c07547ecdfa1b3314252963a36e705ec'

Initializing Datasets
  [0.342s] Training Dataset: MpiSintelFinal
  [0.385s] Training Input: [3, 2, 128, 128]
  [0.424s] Training Targets: [2, 128, 128]
  [0.424s] Operation finished

Building FlowNet2S model
  [0.457s] Effective Batch Size: 8
  [0.457s] Number of parameters: 38676506
  [0.457s] Initializing CUDA
  [4.721s] Parallelizing
  [4.721s] Loading checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar'
  [4.848s] Loaded checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar' (at epoch 0)
  [4.848s] Initializing save directory: ./content/drive/MyDrive/checkpoints
  [4.851s] Operation finished

Initializing Adam Optimizer
  [0.000s] amsgrad = False (<class 'bool'>)
  [0.000s] weight_decay = 0 (<class 'int'>)
  [0.000s] eps = 1e-08 (<class 'float'>)
  [0.000s] betas = (0.9, 0.999) (<class 'tuple'>)
  [0.000s] lr = 0.0001 (<class 'float'>)
  [0.000s] Operation finished

Overall Progress:   0%|                                                      | 0/16 [00:00<?, ?it/s]
Training Epoch 0:   0%|                                                                       | 0/130.0 [00:00<?, ?it/s]Traceback (most recent call last):
  File "main.py", line 439, in <module>
    train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger=train_logger, offset=offset)
  File "main.py", line 269, in train
    losses = model(data[0], target[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 174, in forward
    loss_values = self.loss(output, target)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/flownet2pytorch/losses.py", line 36, in forward
    lossvalue = self.loss(output, target)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "/content/flownet2pytorch/losses.py", line 18, in forward
    lossvalue = torch.abs(output - target).mean()
  File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 320, in __rsub__
    return _C._VariableFunctions.rsub(self, other)
TypeError: rsub() received an invalid combination of arguments - got (Tensor, tuple), but expected one of:
 * (Tensor input, Tensor other, Number alpha)
 * (Tensor input, Number other, Number alpha)

Once again, thank you so much for your help!

Gauravv97 commented 3 years ago

HI @ryannggy , on the Flownet2S error, I could only find this possible solution . Hope this helps.

As for the missing checkpoints, I really don't have a clue what could be missing. ( as long as you are not using both skip_training and skip_validation flags together, they should be saved in --save (./work by default) location as .pth.tar files

Also I am curious, what was the original issue that you encountered? and How did you get it resolved ?

ryngworks commented 3 years ago

hi @gauravv97, you are correct. the issue was solved with handling of the flow file dimensions. your advice has been great help. thank you

Gauravv97 / flownet2-Colab

Training on own dataset? #3