facebookresearch / darkforestGo

DarkForest, the Facebook Go engine.
Other
2.1k stars 325 forks source link

How much memory do I need to train #37

Open HongweiQin opened 7 years ago

HongweiQin commented 7 years ago

Hi,

First of all, thanks to your nice work.

I was trying to run your go engine on my server, which has about 120 GiB memory.

It went all right until I tried to train with provided dataset.

The output are as followed:

[root@localhost darkforestGo]# ./train.sh
{
  nstep = 3,
  optim = "supervised",
  loss = "policy",
  progress = false,
  nthread = 4,
  model_name = "model-12-parallel-384-n-output-bn",
  data_augmentation = true,
  actor = "policy",
  nGPU = 1,
  sampling = "replay",
  intermediate_step = 50,
  userank = true,
  alpha = 0.05,
  num_forward_models = 2048,
  batchsize = 256,
  epoch_size_test = 128000,
  feature_type = "extended",
  epoch_size = 128000,
  datasource = "kgs"
}   
fm_init: function: 0x4076e7c8   
fm_gen: function: 0x410f4a58    
fm_postprocess: nil 
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
| IndexedDataset: loaded ./dataset with 144748 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
rl.Dataset.__init(): forward_model_init is set, run it
| IndexedDataset: loaded ./dataset with 26814 examples
rl.Dataset.__init(): #forward model = 2048, batchsize = 256
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu line=66 error=2 : out of memory
/root/torch/install/bin/luajit: /root/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
In 9 module of nn.Sequential:
/root/torch/install/share/lua/5.1/nn/THNN.lua:110: cuda runtime error (2) : out of memory at /tmp/luarocks_cutorch-scm-1-4547/cutorch/lib/THC/generic/THCStorage.cu:66
stack traceback:
    [C]: in function 'v'
    /root/torch/install/share/lua/5.1/nn/THNN.lua:110: in function 'BatchNormalization_updateOutput'
    /root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:124: in function </root/torch/install/share/lua/5.1/nn/BatchNormalization.lua:113>
    [C]: in function 'xpcall'
    /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function </root/torch/install/share/lua/5.1/nn/Sequential.lua:41>
    [C]: in function 'xpcall'
    /root/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./train/rl_framework/infra/bundle.lua:161: in function 'forward'
    ./train/rl_framework/infra/agent.lua:46: in function 'optimize'
    ./train/rl_framework/infra/engine.lua:114: in function 'train'
    ./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
    train.lua:155: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x004064f0

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /root/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    /root/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'forward'
    ./train/rl_framework/infra/bundle.lua:161: in function 'forward'
    ./train/rl_framework/infra/agent.lua:46: in function 'optimize'
    ./train/rl_framework/infra/engine.lua:114: in function 'train'
    ./train/rl_framework/infra/framework.lua:304: in function 'run_rl'
    train.lua:155: in main chunk
    [C]: in function 'dofile'
    /root/torch/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
    [C]: at 0x004064f0

I ran the "free" command before training. It turns out like this:

[root@localhost darkforestGo]# free
              total        used        free      shared  buff/cache   available
Mem:      115383448     1317128   112506528       10744     1559792   113786336
Swap:      67108860           0    67108860

It seems that I'm facing an "out of memory" issue.

May I ask how much memory do I need to train?

Or, is there anything wrong elsewere?

Thanks in advance

HongweiQin commented 7 years ago

I tried to modify the train.sh by changing the nthread parameter from 4 to 1. It didn't work out.