I downloaded the repo and I'm trying to run examples to test out the repo before moving on. Unfortunately I'm running into a problem with running the training in that almost immediately CUDA runs out of memory. I'm running on a GTX 1050 with 4GB of RAM (about 3GB avalible to use for training), same as the 980 you mentioned you were running on? I was just wondering if you had any ideas about what could be causing this issue! Full error message below.
python main.py --network_type rnn --dataset ptb --controller_optim adam --controller_lr 0.00035 --shared_optim sgd --shared_lr 20.0 --entropy_coeff 0.0001
2018-02-16 22:22:54,351:INFO::[*] Make directories : logs/ptb_2018-02-16_22-22-54
2018-02-16 22:22:59,204:INFO::# of parameters: 146,014,000
2018-02-16 22:22:59,315:INFO::[*] MODEL dir: logs/ptb_2018-02-16_22-22-54
2018-02-16 22:22:59,316:INFO::[*] PARAM path: logs/ptb_2018-02-16_22-22-54/params.json
train_shared: 0%| | 0/14524 [00:00<?, ?it/s]
/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/models/controller.py:96: UserWarning: Implicit dimension choice for softmax has been deprecated. Change the call to include dim=X as an argument.
probs = F.softmax(logits)
/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/models/controller.py:97: UserWarning: Implicit dimension choice for log_softmax has been deprecated. Change the call to include dim=X as an argument.
log_prob = F.log_softmax(logits)
THCudaCheck FAIL file=/opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.cu line=58 error=2 : out of memory
Traceback (most recent call last):
File "main.py", line 45, in <module>
main(args)
File "main.py", line 34, in main
trainer.train()
File "/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/trainer.py", line 87, in train
self.train_shared()
File "/home/mjhutchinson/Documents/Machine Learning/ENAS-pytorch/trainer.py", line 143, in train_shared
loss.backward()
File "/home/mjhutchinson/.conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/mjhutchinson/.conda/envs/pytorch/lib/python3.6/site-packages/torch/autograd/__init__.py", line 99, in backward
variables, grad_variables, retain_graph)
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1518244421288/work/torch/lib/THC/generic/THCStorage.cu:58
If there's any other info that would be helpful pleas let me know!
What I am using is gpu980ti which has 6GB memory. You can reduce --shared_embed and --shared_hid which is a major factor controlling the required memory size.
First off, thanks for making this, looks great!
I downloaded the repo and I'm trying to run examples to test out the repo before moving on. Unfortunately I'm running into a problem with running the training in that almost immediately CUDA runs out of memory. I'm running on a GTX 1050 with 4GB of RAM (about 3GB avalible to use for training), same as the 980 you mentioned you were running on? I was just wondering if you had any ideas about what could be causing this issue! Full error message below.
If there's any other info that would be helpful pleas let me know!