ServiceNow / HighRes-net

Pytorch implementation of HighRes-net, a neural network for multi-frame super-resolution, trained and tested on the European Space Agency’s Kelvin competition. This is a ServiceNow Research project that was started at Element AI.
https://www.elementai.com/news/2019/computer-enhance-please
Other
279 stars 52 forks source link

training problem because of the different sizes #7

Closed margokhokhlova closed 3 years ago

margokhokhlova commented 3 years ago

Hello! I have a problem running the script, I use your docker-compose and getting this: File "src/train.py", line 309, in     main(config)   File "src/train.py", line 294, in main     trainAndGetBestModel(fusion_model, regis_model, optimizer, dataloaders, baseline_cpsnrs, config)   File "src/train.py", line 179, in trainAndGetBestModel     reference=hrs[:, offset:(offset + 128), offset:(offset + 128)].view(-1, 1, 128, 128)) RuntimeError: view size is not compatible with input tensor's size and stride (at least one dimension spans across two contiguous subspaces). Use .reshape(...) instead

If i am changing as the error suggests, I am getting the tensors of an incompatible size here: lrs: tensor (batch size, views, W, H), images to shift reference: tensor (batch size, W, H), reference images to shift -> they will be 64,1, 16,16 and 1,1,128,128

Thank you for your code!

alkalait commented 3 years ago

Thanks for raising the issue, @margokhokhlova

Was this during following the steps in the README for the PROBA-V competition dataset? Or can you expand a bit more on your context?

margokhokhlova commented 3 years ago

Thank you for a quick answer! I just cloned the repo and would like to train the model to test it on the new data. I followed the readme and run save_clearance. The data notebook works fine. Then I tried to train it, using the provided config file, so with patch_size = 32. train.py L174: shape of the src corresponds to B,1,96,96, where 96 is patch_size*3. However, in the next step, in the register_batch function, I am getting the error because I am trying to concatenate the incompatible shapes. lrs and reference shapes in the input of register_batch are: 64,1,16,16 1,1,128,128

alkalait commented 3 years ago

new dataset, as in different from the PROBA-V competition data?

In which case, can you please provide more info about this other dataset?

Please also provide a snippet of the code where the error occurs.

margokhokhlova commented 3 years ago

I am sorry, I didn't write it clearly. I would love to test the algorithm on the new data, but so far for the training I use the dataset from here, as advised in the readme: https://kelvins.esa.int/proba-v-super-resolution/data/ Here is the code snippet: python src/train.py --config config/config.json /usr/local/lib/python3.6/dist-packages/torch/cuda/init.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at  /pytorch/c10/cuda/CUDAFunctions.cpp:100.)   return torch._C._cuda_getDeviceCount() > 0   0%|                                                                | 0/400 [00:00<?, ?it/storch.Size([64, 1, 16, 16])                                            | 0/17 [00:00<?, ?it/s] torch.Size([1, 1, 128, 128])   0%|                                                                 | 0/17 [00:13<?, ?it/s]   0%|                                                                | 0/400 [00:13<?, ?it/s] Traceback (most recent call last):   File "src/train.py", line 310, in     main(config)   File "src/train.py", line 295, in main     trainAndGetBestModel(fusion_model, regis_model, optimizer, dataloaders, baseline_cpsnrs, config)   File "src/train.py", line 180, in trainAndGetBestModel     reference=hrs[:, offset:(offset + 128), offset:(offset + 128)].reshape(-1, 1, 128, 128))   File "src/train.py", line 41, in register_batch     theta = shiftNet(torch.cat([reference, lrs[:, i : i + 1]], 1)) RuntimeError: Sizes of tensors must match except in dimension 1. Got 1 and 64 in dimension 0 (The offending index is 1)

alkalait commented 3 years ago

Have you made any progress in debugging this yourself?

It will be a while till I get to reproduce your error.

margokhokhlova commented 3 years ago

No, I see that the problem is in the sizes, but if I am fixing it manually, there are more errors following the pipeline. Globally, my feeling is that the problem seems to come from the hard-coded offset + 128 in the input to register_batch function inside the training loop..

margokhokhlova commented 3 years ago

I am closing the issue, it works on a gpu, but the problem comes from running on a cpu, I am not able to find out why. Thank you very much for your answers!

alkalait commented 3 years ago

Thanks for resolving this. Wasn’t obvious to me at all cpu vs gpu was the issue.