Closed ryngworks closed 3 years ago
Hi @ryannggy, the official repo has the following command for training on Sintel data
python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4 \ --training_dataset MpiSintelFinal --training_dataset_root /path/to/mpi-sintel/final/dataset \ --validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset
I guess the best way to achieve custom training would be to replace the Sintel data with your own data and flo files.
Unlike inference there isn't an option for training with our own datasets with argument ImagesFromFolder
. This does not fetch flow files in the datasets.py file for ImagesFromFolder class.
You can try to modify the this function (take reference from other classes above) for fetching flow files in getitem() and hope it works.
Please post your solution below if something works for you.
Actually that is exactly what i did, to replace the img and flo files in MpiSintel with my own. I managed to train the network with MpiSintel, but somehow does not work when i replace it with my own (error in the tensor dimensions). I will try to modify the getitem() method. Will update. Thank you
@ryannggy , Also make sure the dimensions of your images, flow files are same. Usually the preferred size (h &w) would be a multiple of 64, because the generated flow will also be a multiple of 64. This is already handled in the MPI sintel class using crop.
I have managed to solve this. But I have another issue, I have been trying to save the checkpoints after training for i.e. 100 epochs but the checkpoint is nowhere to be found (in google drive or in colab drive). How should I run the command to save the checkpoint? Another thing is, I can't seem to make the notebook work for Flownet2S version, is it possible to just use Flownet2S?
For the FlowNet2S issue, this is what i tried:
!CUDA_AVAILABLE_DEVICES=0 python main.py --total_epochs 15 --batch_size 8 --model FlowNet2S --loss=L1Loss \
--optimizer=Adam --optimizer_lr=1e-4 --skip_validation --crop_size 128 128 --training_dataset MpiSintelFinal \
--training_dataset_root /content/drive/MyDrive/fyp/MPI-Sintel-complete/training \
--resume /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar \
--save ./content/drive/MyDrive/checkpoints
This is the error I got:
Parsing Arguments
[0.034s] batch_size: 8
[0.034s] crop_size: [128, 128]
[0.034s] fp16: False
[0.034s] fp16_scale: 1024.0
[0.034s] gradient_clip: None
[0.034s] inference: False
[0.034s] inference_batch_size: 1
[0.034s] inference_dataset: MpiSintelClean
[0.034s] inference_dataset_replicates: 1
[0.034s] inference_dataset_root: ./MPI-Sintel/flow/training
[0.034s] inference_n_batches: -1
[0.034s] inference_size: [-1, -1]
[0.034s] inference_visualize: False
[0.034s] log_frequency: 1
[0.034s] loss: L1Loss
[0.034s] model: FlowNet2S
[0.034s] model_batchNorm: False
[0.034s] model_div_flow: 20
[0.034s] name: run
[0.034s] no_cuda: False
[0.034s] number_gpus: 1
[0.034s] number_workers: 8
[0.034s] optimizer: Adam
[0.034s] optimizer_amsgrad: False
[0.034s] optimizer_betas: (0.9, 0.999)
[0.034s] optimizer_eps: 1e-08
[0.034s] optimizer_lr: 0.0001
[0.034s] optimizer_weight_decay: 0
[0.034s] render_validation: False
[0.034s] resume: /content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar
[0.034s] rgb_max: 255.0
[0.034s] save: ./content/drive/MyDrive/checkpoints
[0.034s] save_flow: False
[0.034s] schedule_lr_fraction: 10
[0.034s] schedule_lr_frequency: 0
[0.034s] seed: 1
[0.034s] skip_training: False
[0.034s] skip_validation: True
[0.034s] start_epoch: 1
[0.034s] total_epochs: 15
[0.034s] train_n_batches: -1
[0.034s] training_dataset: MpiSintelFinal
[0.034s] training_dataset_replicates: 1
[0.034s] training_dataset_root: /content/drive/MyDrive/fyp/MPI-Sintel-complete/training
[0.034s] validation_dataset: MpiSintelClean
[0.034s] validation_dataset_replicates: 1
[0.034s] validation_dataset_root: ./MPI-Sintel/flow/training
[0.034s] validation_frequency: 5
[0.034s] validation_n_batches: -1
[0.037s] Operation finished
Source Code
Current Git Hash: b'00cff7e3c07547ecdfa1b3314252963a36e705ec'
Initializing Datasets
[0.342s] Training Dataset: MpiSintelFinal
[0.385s] Training Input: [3, 2, 128, 128]
[0.424s] Training Targets: [2, 128, 128]
[0.424s] Operation finished
Building FlowNet2S model
[0.457s] Effective Batch Size: 8
[0.457s] Number of parameters: 38676506
[0.457s] Initializing CUDA
[4.721s] Parallelizing
[4.721s] Loading checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar'
[4.848s] Loaded checkpoint '/content/drive/MyDrive/fyp/FlowNet2-S_checkpoint.pth.tar' (at epoch 0)
[4.848s] Initializing save directory: ./content/drive/MyDrive/checkpoints
[4.851s] Operation finished
Initializing Adam Optimizer
[0.000s] amsgrad = False (<class 'bool'>)
[0.000s] weight_decay = 0 (<class 'int'>)
[0.000s] eps = 1e-08 (<class 'float'>)
[0.000s] betas = (0.9, 0.999) (<class 'tuple'>)
[0.000s] lr = 0.0001 (<class 'float'>)
[0.000s] Operation finished
Overall Progress: 0%| | 0/16 [00:00<?, ?it/s]
Training Epoch 0: 0%| | 0/130.0 [00:00<?, ?it/s]Traceback (most recent call last):
File "main.py", line 439, in <module>
train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger=train_logger, offset=offset)
File "main.py", line 269, in train
losses = model(data[0], target[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/data_parallel.py", line 150, in forward
return self.module(*inputs[0], **kwargs[0])
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "main.py", line 174, in forward
loss_values = self.loss(output, target)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/flownet2pytorch/losses.py", line 36, in forward
lossvalue = self.loss(output, target)
File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 547, in __call__
result = self.forward(*input, **kwargs)
File "/content/flownet2pytorch/losses.py", line 18, in forward
lossvalue = torch.abs(output - target).mean()
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 320, in __rsub__
return _C._VariableFunctions.rsub(self, other)
TypeError: rsub() received an invalid combination of arguments - got (Tensor, tuple), but expected one of:
* (Tensor input, Tensor other, Number alpha)
* (Tensor input, Number other, Number alpha)
Once again, thank you so much for your help!
HI @ryannggy , on the Flownet2S error, I could only find this possible solution . Hope this helps.
As for the missing checkpoints, I really don't have a clue what could be missing. ( as long as you are not using both skip_training and skip_validation flags together, they should be saved in --save (./work by default) location as .pth.tar files
Also I am curious, what was the original issue that you encountered? and How did you get it resolved ?
hi @gauravv97, you are correct. the issue was solved with handling of the flow file dimensions. your advice has been great help. thank you
Hi there, I am trying to use your notebook to train on my own dataset. The dataset consists of samples where each sample consists of a single image pair and their respective ground truth flow. I have been trying to make it work on the notebook but it seems to be unable to. Please advise.