RuntimeError: CUDA out of memory while training VarNet

adithyaOvGu commented 3 years ago

Hello,

Since mid 2020, I have been using the fastMRI project, modifying the subsampling scripts to accommodate a custom undersampling pattern and compare the reconstructions to the Varden and equispaced undersampled data. Training and testing the UNet model in the remote GPU server has not been a problem, everything works well (as seen in the below image).

Figure_gt Figure_1

During the training of the VarNet model, I am encountering the following error after 7-9 iterations of the 1st epoch;

RuntimeError: CUDA out of memory Tried to allocate 28.00 MiB (GPU 0; 10.76 GiB total capacity; 9.73 GiB already allocated; 11.76 MiB free; 9.89 GiB reserved in total by PyTorch)

I checked if anyone had raised a similar issue, but the closest thing I could find was https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649439714 and the partial solutionn suggested in https://github.com/facebookresearch/fastMRI/issues/44#issuecomment-649546413 was to "decreasing the size of the model - e.g., --num-cascades 4".

I followed the suggestion and the model training runs without any errors, but the results look bad (as seen in the below image) because of reducing the model size?

Figure_1_2

I am training the model on 150 volumes of multi-coil brain datasets for 50 epochs, I would like to know how to tackle this problem. I kindly request you to provide some suggestions/solutions to overcome this issue

Ever since I pulled the project in mid 2020, I have been working with the same version of python libraries suggested in the requirement.txt file.

Environment Python3, torch 1.5.0, PyTorch-lightning 0.7.6 and torchvision 0.6.0

Desktop (Remote server): OS: Manjaro Linux 64bit / Linux 5.10.2-2-MANJARO Graphics: GeForce RTX 2080 Ti 10GB

mmuckley commented 3 years ago

This is interesting. You would expect worse performance when decreasing cascades, but not this bad. Usually when I see a result this bad it is because of data preprocessing or not setting model.eval(). Could either of these be the issue in this case?

adithyaOvGu commented 3 years ago

This is interesting. You would expect worse performance when decreasing cascades, but not this bad. Usually when I see a result this bad it is because of data preprocessing or not setting model.eval(). Could either of these be the issue in this case?

Well, since I am using the fastMRI datasets, I am following the default data preprocessing steps. With respect to model.eval(), I do not see any such thing mentioned in the scripts. Where could I add this in the following varnet.py script?

def run(args):
    cudnn.benchmark = True
    cudnn.enabled = True
    if args.mode == 'train':
        trainer = create_trainer(args)
        model = VariationalNetworkModel(args)
        trainer.fit(model)
    else:  # args.mode == 'test' or args.mode == 'challenge'
        assert args.checkpoint is not None
        model = VariationalNetworkModel.load_from_checkpoint(str(args.checkpoint))
        model.hparams = args
        model.hparams.sample_rate = 1.
        trainer = create_trainer(args)
        trainer.test(model)

Also, does the CUDA out-of-memory error occur because the model is computational heavy to run on 1 GPU (10GB)? or could it be because of memory leaks? Could this be re-solved by updating the pytorch & pytorch-lightning to v1.0.+?

mmuckley commented 3 years ago

Well, since I am using the fastMRI datasets, I am following the default data preprocessing steps. With respect to model.eval(), I do not see any such thing mentioned in the scripts. Where could I add this in the following varnet.py script?

If you are using PyTorch Lightning then this is automatically done for you.

Also, does the CUDA out-of-memory error occur because the model is computational heavy to run on 1 GPU (10GB)? or could it be because of memory leaks? Could this be re-solved by updating the pytorch & pytorch-lightning to v1.0.+?

No, the model is just really memory-greedy due to the activations. We use 32 GB GPUs for training the leaderboard model. I thought I went through and tried to reduce dangling references and such, but there might still be ways to improve the code. Outside of that, decreasing memory usage would require changing the model.

adithyaOvGu commented 3 years ago

Well, since I am using the fastMRI datasets, I am following the default data preprocessing steps. With respect to model.eval(), I do not see any such thing mentioned in the scripts. Where could I add this in the following varnet.py script?

If you are using PyTorch Lightning then this is automatically done for you.

Also, does the CUDA out-of-memory error occur because the model is computational heavy to run on 1 GPU (10GB)? or could it be because of memory leaks? Could this be re-solved by updating the pytorch & pytorch-lightning to v1.0.+?

No, the model is just really memory-greedy due to the activations. We use 32 GB GPUs for training the leaderboard model. I thought I went through and tried to reduce dangling references and such, but there might still be ways to improve the code. Outside of that, decreasing memory usage would require changing the model.

So, with the current state of the model, are my options limited to training the model on multiple GPUs / GPUs with large memory or enable 16-bit precisions in Trainer class, so that I do not have to reduce the number of cascades?

mmuckley commented 3 years ago

It depends on your goals.

I think if you want to reproduce the results of the leaderboard you need to train the model with 32-bit floats on a 32 GB GPU. If you don't have a 32 GB GPU, there might be some model parallelism things you have to do. Once you start talking about 16-bit precision, it's a different model and you can change everything. Also keep in mind that if you want to reproduce results, you can download one of our pretrained models.

adithyaOvGu commented 3 years ago

It depends on your goals.

I think if you want to reproduce the results of the leaderboard you need to train the model with 32-bit floats on a 32 GB GPU. If you don't have a 32 GB GPU, there might be some model parallelism things you have to do. Once you start talking about 16-bit precision, it's a different model and you can change everything. Also keep in mind that if you want to reproduce results, you can download one of our pretrained models.

Thanks for the quick response Mr.Muckley.

My goal is to train the models on custom (in-house undersampling pattern) undersampled data + data undersampled with Varden & equispaced masks, and compare the image quality of reconstructed images using the metrics. For this purpose, I have to train the models on the custom mask. So, I am not sure if using pre-trained network models would be helpful :/

I will try to get access to servers/machines with 32GB GPU. If that is not possible, could I train the model using multiple GPUs by setting the Lightning backend distributed_backend='dp' or 'ddp' in the Trainer()?

Also, what do you mean by "you can change everything" if precision=16 is set in Trainer()? What changes does the model need?

mmuckley commented 3 years ago

@adithyaOvGu, unfortunately that won't work. dp and ddp are for data parallelism, not model parallelism. You might want to read more here.

adithyaOvGu commented 3 years ago

@adithyaOvGu, unfortunately that won't work. dp and ddp are for data parallelism, not model parallelism. You might want to read more here.

Thanks a lot for the support and the suggestions. I will take a look at the link and learn how to do model parallelism.

So just to confirm once again, I can still work with the torch 1.5.0, PyTorch-lightning 0.7.6, and torchvision 0.6.0 (older) versions, right? And I just have to train the model on servers/machines with 32GB GPU or perform model parallelism.

mmuckley commented 3 years ago

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.

adithyaOvGu commented 3 years ago

For exact supported versions, please see the requirements.txt. You can probably use other versions, but we make not guarantees.

Thanks a lot for the suggestions and support Mr.Muckley. I am now working on implementing the DDP with model parallel to the script and I will update the outcome here. Should we consider this issue closed?

facebookresearch / fastMRI

RuntimeError: CUDA out of memory while training VarNet #137