facebookresearch / fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit
Other
3.74k stars 616 forks source link

train the example model error: Segmentation fault #101

Open Lingogo opened 7 years ago

Lingogo commented 7 years ago

Hi: When I train the de-en model with the command in github README, I got following error info:

| [en] Dictionary: 24738 types
| [de] Dictionary: 35474 types
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 160215 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 7282 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 6750 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 7282 examples
| IndexedDataset: loaded data-bin/iwslt14.tokenized.de-en with 6750 examples
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/generic/THCTensorMath.cu line=26 error=77 : an illegal memory access was encountered
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/generic/THCStorage.cu line=66 error=77 : an illegal memory access was encountered
/home/yulinlin/torch/install/bin/luajit: ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:183: [thread 6 callback] /home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:67: 
In 3 module of nn.Sequential:
/home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:26: Creating MTGP constants failed. at /tmp/luarocks_cutorch-scm-1-9601/cutorch/lib/THC/THCTensorRandom.cu:33
stack traceback:
    [C]: in function 'bernoulli'
    /home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:26: in function </home/yulinlin/torch/install/share/lua/5.1/nn/Dropout.lua:17>
    [C]: in function 'xpcall'
    /home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
    ...e/yulinlin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
    ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:370: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:347>
    [C]: in function 'xpcall'
    ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
    ...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:41>
    [C]: in function 'pcall'
    ...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
    [string "  local Queue = require 'threads.queue'..."]:13: in main chunk

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
    [C]: in function 'error'
    /home/yulinlin/torch/install/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
    ...e/yulinlin/torch/install/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
    ...yulinlin/torch/install/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
    ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:370: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:347>
    [C]: in function 'xpcall'
    ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
    ...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:65: in function <...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:41>
    [C]: in function 'pcall'
    ...e/yulinlin/torch/install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
    [string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
    [C]: in function 'error'
    ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
    ...yulinlin/torch/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
    ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:385: in function 'doTrain'
    ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:189: in function 'train'
    ...in/torch/install/share/lua/5.1/fairseq/scripts/train.lua:410: in main chunk
    [C]: in function 'require'
    ...rch/install/lib/luarocks/rocks/fairseq/scm-1/bin/fairseq:17: in main chunk
    [C]: at 0x00406670
Segmentation fault

Does someone know any causes of this?

jgehring commented 7 years ago

The backtrace points to an error in the nn.Dropout module. I can only guess, but are you maybe running out of GPU memory? Does your GPU work well for other use-cases?

Lingogo commented 7 years ago

The GPU memory of the computer is enough to run the training model, but I think the error may still be caused by the GPU environment, because when I switched to another computer, everything goes well. I will check the error then. Thanks a lot. @jgehring