facebookresearch / fastMRI

A large-scale dataset of both raw MRI measurements and clinical MRI images.
https://fastmri.org
MIT License
1.34k stars 373 forks source link

ValueError: when running VarNet #44

Closed wizofe closed 4 years ago

wizofe commented 4 years ago

I get the following ValueError when I attempt to run the VarNet. Any idea why? I am using the NYU multi-coil knee dataset but just limited (10 training h5py files). I have in my environment pytorch-lightning 0.6.0 and torch 1.3.1 with torchvision 0.4.2.

This is what I am using to train:

python models/varnet/varnet.py --resolution 320 --mode train --challenge multicoil --exp var_net --mask-type random --data-path /media/iva19/multicoil_train/

and that's the error:

INFO:root:gpu available: True, used: True
INFO:root:VISIBLE GPUS: 0
Traceback (most recent call last):
  File "models/varnet/varnet.py", line 374, in <module>
    main()
  File "models/varnet/varnet.py", line 371, in main
    run(args)
  File "models/varnet/varnet.py", line 342, in run
    trainer.fit(model)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 687, in fit
    mp.spawn(self.ddp_train, nprocs=self.num_gpus, args=(model,))
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 171, in spawn
    while not spawn_context.join():
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 118, in join
    raise Exception(msg)
Exception: 

-- Process 0 terminated with the following error:
Traceback (most recent call last):
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/torch/multiprocessing/spawn.py", line 19, in _wrap
    fn(i, *args)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/pytorch_lightning/trainer/distrib_data_parallel.py", line 331, in ddp_train
    self.run_pretrain_routine(model)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/pytorch_lightning/trainer/trainer.py", line 757, in run_pretrain_routine
    self.logger.log_hyperparams(ref_model.hparams)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/pytorch_lightning/logging/base.py", line 14, in wrapped_fn
    fn(self, *args, **kwargs)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/pytorch_lightning/logging/tensorboard.py", line 88, in log_hyperparams
    self.experiment.add_hparams(hparam_dict=params, metric_dict={})
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/torch/utils/tensorboard/writer.py", line 292, in add_hparams
    exp, ssi, sei = hparams(hparam_dict, metric_dict)
  File "/home/iva19/usr/local/miniconda3/envs/fastMRI/lib/python3.6/site-packages/torch/utils/tensorboard/summary.py", line 156, in hparams
    raise ValueError('value should be one of int, float, str, bool, or torch.Tensor')
ValueError: value should be one of int, float, str, bool, or torch.Tensor
tianweiy commented 4 years ago

I also met the same problem. I think the current implementation of varnet has multiple bugs to work with PyTorch lightning's interface. I basically remove all PyTorch lightning stuff and get it run. I hope the authors can fix this some time.

mmuckley commented 4 years ago

Hello, I just ran this same script successfully on my system:

python varnet.py --resolution 320 --mode train --challenge multicoil --exp var_net --mask-type random --data-path /my/data/path

The only difference was that I copied the varnet.py into the main directory. I used 0.7.6 of pytorch_lightning and 1.5 of torch.

Typically when I see this error it's because the hparams variable has something in it that is not an int, float, str, bool, or torch.Tensor. Did you alter the hyperparameters anywhere besides the command line?

tianweiy commented 4 years ago

Thanks. I will try this soon. But 0.7.6 of pytorch_lightning this is different from the requirement file I guess

tianweiy commented 4 years ago

upgrade to 0.7.6 fixed the problem.

mmuckley commented 4 years ago

Great. I'm going to be going through the repository soon and trying to clean up a few things, including requirements.txt.

mmuckley commented 4 years ago

@wizofe Let me know if you have any updates on your issue.

wizofe commented 4 years ago

Hi Matthew, thanks for your quick reply here. First things first, your suggested pytorch_lightning version solved that problem (FYI, upgrading to the latest version breaks compatibility, you may want to have a look at this if you are cleaning up).

Now I am getting a different error though CUDA is out of memory, even though I do have enough memory available. It seems that this is an open issue on Pytorch.

The exact error is:

RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 7.93 GiB total capacity; 7.18 GiB already allocated; 27.56 MiB free; 7.36 GiB reserved in total by PyTorch)

although nvidia-smi gives (and I've tried to use both and either of those GPUs, getting the exact same error):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Quadro M4000        On   | 00000000:03:00.0 Off |                  N/A |
| 49%   46C    P8    11W / 120W |     93MiB /  8118MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  TITAN V             On   | 00000000:04:00.0 Off |                  N/A |
| 32%   46C    P8    28W / 250W |      0MiB / 12066MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

I will investigate further but thanks for solving that one!

ioannis

mmuckley commented 4 years ago

I saw that as well. This model is pretty heavy on memory, also on my 16 GB GPU. Perhaps they prototyped it on a 32 GB GPU.

I was able to get past this error by decreasing the size of the model - e.g., --num-cascades 4.

wizofe commented 4 years ago

Thank you @mmuckley for your quick response and prompt help🙏🏼