I was trying to run your go engine on my server, which has about 120 GiB memory.
It went all right until I tried to train with provided dataset.
The output are as followed:
[root@localhost darkforestGo]# ./train.sh
{
nstep = 3,
optim = "supervised",
loss = "policy",
progress = false,
nthread = 4,
model_name = "model-12-parallel-384-n-output-bn",
data_augmentation = true,
actor = "policy",
nGPU = 1,
sampling = "replay",
intermediate_step = 50,
userank = true,
alpha = 0.05,
num_forward_models = 2048,
batchsize = 256,
epoch_size_test = 128000,
feature_type = "extended",
epoch_size = 128000,
datasource = "kgs"
}
fm_init: function: 0x4076e7c8
fm_gen: function: 0x410f4a58
fm_postprocess: nil
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 9 module of nn.Sequential:
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
[C]: in function 'v'
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_updateOutput'
/root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:124: in function </root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:113>
[C]: in function 'xpcall'
/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
/root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x004064f0
I ran the "free" command before training. It turns out like this:
[root@localhost darkforestGo]# free
total used free shared buff/cache available
Mem: 115383448 1317128 112506528 10744 1559792 113786336
Swap: 67108860 0 67108860
It seems that I'm facing an "out of memory" issue.
Hi,
First of all, thanks to your nice work.
I was trying to run your go engine on my server, which has about 120 GiB memory.
It went all right until I tried to train with provided dataset.
The output are as followed:
I ran the "free" command before training. It turns out like this:
It seems that I'm facing an "out of memory" issue.
May I ask how much memory do I need to train?
Or, is there anything wrong elsewere?
Thanks in advance