allenai / satlas-super-resolution

Apache License 2.0
190 stars 24 forks source link

use multi gpus to train highresnet using Probav dataset #8

Closed Laymanpython closed 6 months ago

Laymanpython commented 7 months ago

Hi! Dear authors,when I use multi gpus to train highresnet using Probav dataset, I found a bug "RuntimeError: Function BroadcastBackward returned an invalid gradient at index 6 - got [0] but expected shape compatible with [1, 0, 3, 3]", I'm find the method to fix it. If you have solution, please tell me,thanks~

piperwolters commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

Laymanpython commented 6 months ago

Thanks for your reply,the config file I used was probav_highresnet.yml.A nd I just changed the path of data and dataset file.The PROBA-V dataset I used was from RAMs[https://github.com/EscVM/RAMS/tree/master/probav_data] .And I just use validation and train datasets.There was no problem when I use esrgan to run .Thanks again!

image image

image

Laymanpython commented 6 months ago

I found your paper didn't give the results of proba-v and mus2 datasets.Will you open it in the future? And I found you used opencv to read the image in probav,every pixel will be a int in [0,255].But when I used skimage to read it, every pixel will be a int in [0,65535]. If you convert a int in [0,65535] to [0,255], it will be a float not int. Will processing data in this way cause a loss of accuracy? Thanks~

Laymanpython commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

piperwolters commented 6 months ago

Thanks for your reply,the config file I used was probav_highresnet.yml.A nd I just changed the path of data and dataset file.The PROBA-V dataset I used was from RAMs[https://github.com/EscVM/RAMS/tree/master/probav_data] .And I just use validation and train datasets.There was no problem when I use esrgan to run .Thanks again!

image image

image

Here are screenshots of my PROBA-V data directory. It looks similar to yours, but I took a subset of the imgsets in train/NIR/ and moved them to train/NIR/val/. Looks like you changed the code to work with your setup already. I just ran probav_esrgan.yml and probav_highresnet.yml (from the experiments branch) in my environment and both are working. I'm not sure what is causing the error you are getting.

Screen Shot 2024-01-23 at 8 49 19 AM Screen Shot 2024-01-23 at 8 48 57 AM
piperwolters commented 6 months ago

I found your paper didn't give the results of proba-v and mus2 datasets.Will you open it in the future? And I found you used opencv to read the image in probav,every pixel will be a int in [0,255].But when I used skimage to read it, every pixel will be a int in [0,65535]. If you convert a int in [0,65535] to [0,255], it will be a float not int. Will processing data in this way cause a loss of accuracy? Thanks~

I was worried about that. I ran parallel experiments, one using opencv and the other using skimage. The ultimate performance was very similar. I originally switched to opencv because of some memory leakage issues that I had with skimage, but maybe the risk of accuracy loss is not worth it.

And I did not report results of MuS2 because it is just a test set. I think I will try evaluating on MuS2 with models pretrained on different datasets (S2NAIP, WorldStrat, ...) but I don't know if that will tell us anything interesting.

piperwolters commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Laymanpython commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Thanks for your reply! Perhaps errors I met were caused by my environment.When I used python3.8,all packages will be downloaded successfully.But using python3.9 won't.Thanks again,I will give the solution to solve it in this issue,as long as I can solve it haha ~

Laymanpython commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Could you please tell me your version number of environment? And why is the cPSNR you gave in the paper so small? I train it in probav,finding the val's cpsnr will reach almost 44 . image

Laymanpython commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Hi,sorry to bother you too many times haha,I have solve it! Because the class of highresnet inherits SSRNet class.So there are some block didn't attend the training of HighResnet, their gradient will be none.Just inheriting nn.Module will be okay, and every block were written in Highresnet's file.Perhaps my meaning is not clear,my English is so poor.This is a link to slove it: https://blog.csdn.net/YuhsiHu/article/details/134185740

piperwolters commented 6 months ago

https://blog.csdn.net/YuhsiHu/article/details/134185740

In the paper, I used a crop border of 0, but realized I should use a crop border of 4 to match previous works. I believe all cPSNR values should be higher with the larger crop border.

piperwolters commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Hi,sorry to bother you too many times haha,I have solve it! Because the class of highresnet inherits SSRNet class.So there are some block didn't attend the training of HighResnet, their gradient will be none.Just inheriting nn.Module will be okay, and every block were written in Highresnet's file.Perhaps my meaning is not clear,my English is so poor.This is a link to slove it: https://blog.csdn.net/YuhsiHu/article/details/134185740

Oh I wonder if it is because HighResNet inherits the self.resize() module from SRCNN, but does not use it...

Laymanpython commented 6 months ago

Can you please share the config file you're using and the structure of your downloaded PROBA-V dataset?

And when I set os.environ['CUDA_VISIBLE_DEVICES'] = “1,2,3,4” in train.py, I found that the "cuda:0" will be used too. Maybe should I set in terminal?

Ah, yeah. I would just export CUDA_VISIBLE_DEVICES=1,2,3,4 in the terminal.

Hi,sorry to bother you too many times haha,I have solve it! Because the class of highresnet inherits SSRNet class.So there are some block didn't attend the training of HighResnet, their gradient will be none.Just inheriting nn.Module will be okay, and every block were written in Highresnet's file.Perhaps my meaning is not clear,my English is so poor.This is a link to slove it: https://blog.csdn.net/YuhsiHu/article/details/134185740

Oh I wonder if it is because HighResNet inherits the self.resize() module from SRCNN, but does not use it...

emmm. I think it was caused by mask_encoder and some blocks: self.mask_encoder = nn.Sequential( OneHot(num_classes=12), DoubleConv2d(in_channels=self.mask_channels, out_channels=1, kernel_size=3), nn.Sigmoid(), )

    # Fusion
    self.doubleconv2d = DoubleConv2d(
        in_channels=hidden_channels * revisits,  # revisits as channels
        out_channels=hidden_channels,
        kernel_size=kernel_size,
        use_batchnorm=self.use_batchnorm,
    )
    self.residualblocks = nn.Sequential(
        *(
            ResidualBlock(
                in_channels=hidden_channels,
                kernel_size=kernel_size,
                use_batchnorm=self.use_batchnorm,
            )
            for _ in range(residual_layers)
        )
    )

so I rewrite the highresnet inheriting from nn.Module: `""" Adapted from: https://https://github.com/worldstrat/worldstrat/blob/main/src/modules.py Authors: Ivan Oršolić, Julien Cornebise, Ulf Mertens, Freddie Kalaitzis """ from basicsr.utils.registry import ARCH_REGISTRY from .srcnn_arch import SRCNN from .arch_util import * from kornia.geometry.transform import Resize

@ARCH_REGISTRY.register() class HighResNet(nn.Module): """ High-resolution CNN. Inherits as many elements from SRCNN as possible for as fair a comparison:

piperwolters commented 6 months ago

Ah, okay. I am trying to figure out why I can't replicate your error, but thank you for sharing your code - I will try it and make sure it produces similar results.

Laymanpython commented 6 months ago

Ah, okay. I am trying to figure out why I can't replicate your error, but thank you for sharing your code - I will try it and make sure it produces similar results. The python version I used is 3.8,and I ran the code yesterday-all results are similar.Thank you for opening your work,this is an excellent work!~

piperwolters commented 6 months ago

Awesome, thank you for digging into the code! I'll push a fix to the error you mention, with your change.