I attempted training (kgs data) with train.sh I installed most recent version of torch, and cuda 8.0.
Training seem to end soon with error as below:
{
nstep = 3,
optim = "supervised",
loss = "policy",
progress = false,
nthread = 4,
model_name = "model-12-parallel-384-n-output-bn",
data_augmentation = true,
actor = "policy",
nGPU = 1,
sampling = "replay",
intermediate_step = 50,
userank = true,
alpha = 0.05,
num_forward_models = 2048,
batchsize = 256,
epoch_size_test = 128000,
feature_type = "extended",
epoch_size = 128000,
datasource = "kgs"
}
fm_init: function: 0x40af2138
fm_gen: function: 0x41d64210
fm_postprocess: nil
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
rl.Dataset.init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu line=65 error=2 : out of memory
/home/lin/torch/install/bin/luajit: /home/lin/torch/install/share/lua/5.1/nn/Container.lua:67:
In 1 module of nn.Sequential:
In 5 module of nn.Sequential:
/home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu:65
stack traceback:
[C]: in function 'resizeAs'
/home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: in function 'createIODescriptors'
/home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:41: in function </home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:40>
[C]: in function 'xpcall'
/home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:41>
[C]: in function 'xpcall'
/home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
/home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
.../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
[C]: in function 'error'
/home/lin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
/home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
./train/rl_framework/infra/bundle.lua:161: in function 'forward'
./train/rl_framework/infra/agent.lua:46: in function 'optimize'
./train/rl_framework/infra/engine.lua:114: in function 'train'
./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
train.lua:155: in main chunk
[C]: in function 'dofile'
.../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk
[C]: at 0x00405d50
I attempted training (kgs data) with train.sh I installed most recent version of torch, and cuda 8.0. Training seem to end soon with error as below: { nstep = 3, optim = "supervised", loss = "policy", progress = false, nthread = 4, model_name = "model-12-parallel-384-n-output-bn", data_augmentation = true, actor = "policy", nGPU = 1, sampling = "replay", intermediate_step = 50, userank = true, alpha = 0.05, num_forward_models = 2048, batchsize = 256, epoch_size_test = 128000, feature_type = "extended", epoch_size = 128000, datasource = "kgs" }
fm_init: function: 0x40af2138
fm_gen: function: 0x41d64210
fm_postprocess: nil rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu line=65 error=2 : out of memory /home/lin/torch/install/bin/luajit: /home/lin/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: In 5 module of nn.Sequential: /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu:65 stack traceback: [C]: in function 'resizeAs' /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: in function 'createIODescriptors' /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:41: in function </home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:40> [C]: in function 'xpcall' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train/rl_framework/infra/bundle.lua:161: in function 'forward' ./train/rl_framework/infra/agent.lua:46: in function 'optimize' ./train/rl_framework/infra/engine.lua:114: in function 'train' ./train/rl_framework/infra/framework.lua:304: in function 'run_rl' train.lua:155: in main chunk [C]: in function 'dofile' .../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d50
WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train/rl_framework/infra/bundle.lua:161: in function 'forward' ./train/rl_framework/infra/agent.lua:46: in function 'optimize' ./train/rl_framework/infra/engine.lua:114: in function 'train' ./train/rl_framework/infra/framework.lua:304: in function 'run_rl' train.lua:155: in main chunk [C]: in function 'dofile' .../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d50