Closed margokhokhlova closed 3 years ago
Thanks for raising the issue, @margokhokhlova
Was this during following the steps in the README for the PROBA-V competition dataset? Or can you expand a bit more on your context?
Thank you for a quick answer! I just cloned the repo and would like to train the model to test it on the new data. I followed the readme and run save_clearance. The data notebook works fine. Then I tried to train it, using the provided config file, so with patch_size = 32. train.py L174: shape of the src corresponds to B,1,96,96, where 96 is patch_size*3. However, in the next step, in the register_batch function, I am getting the error because I am trying to concatenate the incompatible shapes. lrs and reference shapes in the input of register_batch are: 64,1,16,16 1,1,128,128
new dataset, as in different from the PROBA-V competition data?
In which case, can you please provide more info about this other dataset?
Please also provide a snippet of the code where the error occurs.
I am sorry, I didn't write it clearly.
I would love to test the algorithm on the new data, but so far for the training I use the dataset from here, as advised in the readme:
https://kelvins.esa.int/proba-v-super-resolution/data/
Here is the code snippet:
python src/train.py --config config/config.json
/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
0%| | 0/400 [00:00<?, ?it/storch.Size([64, 1, 16, 16]) | 0/17 [00:00<?, ?it/s]
torch.Size([1, 1, 128, 128])
0%| | 0/17 [00:13<?, ?it/s]
0%| | 0/400 [00:13<?, ?it/s]
Traceback (most recent call last):
File "src/train.py", line 310, in
Have you made any progress in debugging this yourself?
It will be a while till I get to reproduce your error.
No, I see that the problem is in the sizes, but if I am fixing it manually, there are more errors following the pipeline. Globally, my feeling is that the problem seems to come from the hard-coded offset + 128 in the input to register_batch function inside the training loop..
I am closing the issue, it works on a gpu, but the problem comes from running on a cpu, I am not able to find out why. Thank you very much for your answers!
Thanks for resolving this. Wasn’t obvious to me at all cpu vs gpu was the issue.
Hello! I have a problem running the script, I use your docker-compose and getting this: File "src/train.py", line 309, in
main(config)
File "src/train.py", line 294, in main
trainAndGetBestModel(fusion_model, regis_model, optimizer, dataloaders, baseline_cpsnrs, config)
File "src/train.py", line 179, in trainAndGetBestModel
reference=hrs[:, offset:(offset + 128), offset:(offset + 128)].view(-1, 1, 128, 128))
RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead
If i am changing as the error suggests, I am getting the tensors of an incompatible size here: lrs: tensor (batch size, views, W, H), images to shift reference: tensor (batch size, W, H), reference images to shift -> they will be 64,1, 16,16 and 1,1,128,128
Thank you for your code!