training problem because of the different sizes

margokhokhlova commented 3 years ago

Hello! I have a problem running the script, I use your docker-compose and getting this: File "src/train.py", line 309, in main(config) File "src/train.py", line 294, in main trainAndGetBestModel(fusion_model, regis_model, optimizer, dataloaders, baseline_cpsnrs, config) File "src/train.py", line 179, in trainAndGetBestModel reference=hrs[:, offset:(offset + 128), offset:(offset + 128)].view(-1, 1, 128, 128)) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead

If i am changing as the error suggests, I am getting the tensors of an incompatible size here: lrs: tensor (batch size, views, W, H), images to shift reference: tensor (batch size, W, H), reference images to shift -> they will be 64,1, 16,16 and 1,1,128,128

Thank you for your code!

alkalait commented 3 years ago

Thanks for raising the issue, @margokhokhlova

Was this during following the steps in the README for the PROBA-V competition dataset? Or can you expand a bit more on your context?

margokhokhlova commented 3 years ago

Thank you for a quick answer! I just cloned the repo and would like to train the model to test it on the new data. I followed the readme and run save_clearance. The data notebook works fine. Then I tried to train it, using the provided config file, so with patch_size = 32. train.py L174: shape of the src corresponds to B,1,96,96, where 96 is patch_size*3. However, in the next step, in the register_batch function, I am getting the error because I am trying to concatenate the incompatible shapes. lrs and reference shapes in the input of register_batch are: 64,1,16,16 1,1,128,128

alkalait commented 3 years ago

new dataset, as in different from the PROBA-V competition data?

In which case, can you please provide more info about this other dataset?

Please also provide a snippet of the code where the error occurs.

margokhokhlova commented 3 years ago

I am sorry, I didn't write it clearly. I would love to test the algorithm on the new data, but so far for the training I use the dataset from here, as advised in the readme: https://kelvins.esa.int/proba-v-super-resolution/data/ Here is the code snippet: python src/train.py --config config/config.json /usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.) return torch._C._cuda_getDeviceCount() > 0 0%| | 0/400 [00:00<?, ?it/storch.Size([64, 1, 16, 16]) | 0/17 [00:00<?, ?it/s] torch.Size([1, 1, 128, 128]) 0%| | 0/17 [00:13<?, ?it/s] 0%| | 0/400 [00:13<?, ?it/s] Traceback (most recent call last): File "src/train.py", line 310, in main(config) File "src/train.py", line 295, in main trainAndGetBestModel(fusion_model, regis_model, optimizer, dataloaders, baseline_cpsnrs, config) File "src/train.py", line 180, in trainAndGetBestModel reference=hrs[:, offset:(offset + 128), offset:(offset + 128)].reshape(-1, 1, 128, 128)) File "src/train.py", line 41, in register_batch theta = shiftNet(torch.cat([reference, lrs[:, i : i + 1]], 1)) RuntimeError: Sizes of tensors must match except in dimension 1. Got 1 and 64 in dimension 0 (The offending index is 1)

alkalait commented 3 years ago

Have you made any progress in debugging this yourself?

It will be a while till I get to reproduce your error.

margokhokhlova commented 3 years ago

No, I see that the problem is in the sizes, but if I am fixing it manually, there are more errors following the pipeline. Globally, my feeling is that the problem seems to come from the hard-coded offset + 128 in the input to register_batch function inside the training loop..

margokhokhlova commented 3 years ago

I am closing the issue, it works on a gpu, but the problem comes from running on a cpu, I am not able to find out why. Thank you very much for your answers!

alkalait commented 3 years ago

Thanks for resolving this. Wasn’t obvious to me at all cpu vs gpu was the issue.

ServiceNow / HighRes-net

training problem because of the different sizes #7