facebookresearch / darkforestGo

DarkForest, the Facebook Go engine.
Other
2.1k stars 325 forks source link

crash in training #30

Closed FengliLin closed 7 years ago

FengliLin commented 7 years ago

I attempted training (kgs data) with train.sh I installed most recent version of torch, and cuda 8.0. Training seem to end soon with error as below: { nstep = 3, optim = "supervised", loss = "policy", progress = false, nthread = 4, model_name = "model-12-parallel-384-n-output-bn", data_augmentation = true, actor = "policy", nGPU = 1, sampling = "replay", intermediate_step = 50, userank = true, alpha = 0.05, num_forward_models = 2048, batchsize = 256, epoch_size_test = 128000, feature_type = "extended", epoch_size = 128000, datasource = "kgs" }
fm_init: function: 0x40af2138
fm_gen: function: 0x41d64210
fm_postprocess: nil rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 144748 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 rl.Dataset.init(): forward_model_init is set, run it | IndexedDataset: loaded ./dataset with 26814 examples rl.Dataset.init(): #forward model = 2048, batchsize = 256 THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu line=65 error=2 : out of memory /home/lin/torch/install/bin/luajit: /home/lin/torch/install/share/lua/5.1/nn/Container.lua:67: In 1 module of nn.Sequential: In 5 module of nn.Sequential: /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu:65 stack traceback: [C]: in function 'resizeAs' /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:15: in function 'createIODescriptors' /home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:41: in function </home/lin/torch/install/share/lua/5.1/cudnn/Pointwise.lua:40> [C]: in function 'xpcall' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:41> [C]: in function 'xpcall' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train/rl_framework/infra/bundle.lua:161: in function 'forward' ./train/rl_framework/infra/agent.lua:46: in function 'optimize' ./train/rl_framework/infra/engine.lua:114: in function 'train' ./train/rl_framework/infra/framework.lua:304: in function 'run_rl' train.lua:155: in main chunk [C]: in function 'dofile' .../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d50

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above. stack traceback: [C]: in function 'error' /home/lin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors' /home/lin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward' ./train/rl_framework/infra/bundle.lua:161: in function 'forward' ./train/rl_framework/infra/agent.lua:46: in function 'optimize' ./train/rl_framework/infra/engine.lua:114: in function 'train' ./train/rl_framework/infra/framework.lua:304: in function 'run_rl' train.lua:155: in main chunk [C]: in function 'dofile' .../lin/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:145: in main chunk [C]: at 0x00405d50

yuandong-tian commented 7 years ago
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-6693/cutorch/lib/THC/generic/THCStorage.cu line=65 error=2 : out of memory

Try reducing the number of threads and batchsize.

FengliLin commented 7 years ago

Solved. Thx a lot!