Closed wizofe closed 4 years ago
I also met the same problem. I think the current implementation of varnet has multiple bugs to work with PyTorch lightning's interface. I basically remove all PyTorch lightning stuff and get it run. I hope the authors can fix this some time.
Hello, I just ran this same script successfully on my system:
python varnet.py --resolution 320 --mode train --challenge multicoil --exp var_net --mask-type random --data-path /my/data/path
The only difference was that I copied the varnet.py
into the main directory. I used 0.7.6 of pytorch_lightning
and 1.5 of torch
.
Typically when I see this error it's because the hparams
variable has something in it that is not an int
, float
, str
, bool
, or torch.Tensor
. Did you alter the hyperparameters anywhere besides the command line?
Thanks. I will try this soon. But 0.7.6 of pytorch_lightning this is different from the requirement file I guess
upgrade to 0.7.6 fixed the problem.
Great. I'm going to be going through the repository soon and trying to clean up a few things, including requirements.txt
.
@wizofe Let me know if you have any updates on your issue.
Hi Matthew, thanks for your quick reply here. First things first, your suggested pytorch_lightning
version solved that problem (FYI, upgrading to the latest version breaks compatibility, you may want to have a look at this if you are cleaning up).
Now I am getting a different error though CUDA is out of memory, even though I do have enough memory available. It seems that this is an open issue on Pytorch.
The exact error is:
RuntimeError: CUDA out of memory. Tried to allocate 28.00 MiB (GPU 0; 7.93 GiB total capacity; 7.18 GiB already allocated; 27.56 MiB free; 7.36 GiB reserved in total by PyTorch)
although nvidia-smi
gives (and I've tried to use both and either of those GPUs, getting the exact same error):
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00 Driver Version: 440.64.00 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro M4000 On | 00000000:03:00.0 Off | N/A |
| 49% 46C P8 11W / 120W | 93MiB / 8118MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 TITAN V On | 00000000:04:00.0 Off | N/A |
| 32% 46C P8 28W / 250W | 0MiB / 12066MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
I will investigate further but thanks for solving that one!
ioannis
I saw that as well. This model is pretty heavy on memory, also on my 16 GB GPU. Perhaps they prototyped it on a 32 GB GPU.
I was able to get past this error by decreasing the size of the model - e.g., --num-cascades 4
.
Thank you @mmuckley for your quick response and prompt help🙏🏼
I get the following
ValueError
when I attempt to run theVarNet
. Any idea why? I am using the NYU multi-coil knee dataset but just limited (10 training h5py files). I have in my environmentpytorch-lightning 0.6.0
andtorch 1.3.1
withtorchvision 0.4.2
.This is what I am using to train:
python models/varnet/varnet.py --resolution 320 --mode train --challenge multicoil --exp var_net --mask-type random --data-path /media/iva19/multicoil_train/
and that's the error: