facebookresearch / fairseq-lua

Facebook AI Research Sequence-to-Sequence Toolkit
Other
3.74k stars 616 forks source link

Traning crashing due to assertion errors #81

Closed eenin closed 7 years ago

eenin commented 7 years ago

Hi, I am trying to train a fully conv. model on a en-fr pair, but it keeps crashing ~30mins into the training. Here is the stacktrace of the assertion errors I am getting:

| epoch 000 | 0001000 updates | words/s    3269| trainloss     8.94 | train ppl   490.14
| epoch 000 | 0002000 updates | words/s    3375| trainloss     6.66 | train ppl   101.14
| epoch 000 | 0003000 updates | words/s    3341| trainloss     5.79 | train ppl    55.33
| epoch 000 | 0004000 updates | words/s    3347| trainloss     5.22 | train ppl    37.19
| epoch 000 | 0005000 updates | words/s    3404| trainloss     4.85 | train ppl    28.83
| epoch 000 | 0006000 updates | words/s    3331| trainloss     4.56 | train ppl    23.60
| epoch 000 | 0007000 updates | words/s    3301| trainloss     4.34 | train ppl    20.20
| epoch 000 | 0008000 updates | words/s    3343| trainloss     4.15 | train ppl    17.70
| epoch 000 | 0009000 updates | words/s    3383| trainloss     4.00 | train ppl    15.99
| epoch 000 | 0010000 updates | words/s    3408| trainloss     3.87 | train ppl    14.64
| epoch 000 | 0011000 updates | words/s    3372| trainloss     3.79 | train ppl    13.80
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [32,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [33,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [34,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [35,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [36,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [37,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [38,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [39,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [40,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [41,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [42,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [43,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [44,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [45,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [46,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [47,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [48,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [49,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [50,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [51,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [52,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [53,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [54,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [55,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [56,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [57,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [58,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [59,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [60,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [61,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [62,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/THCTensorIndex.cu:321: void indexSelectLargeIndex(TensorInfo<T, IndexType>, TensorInfo<T, IndexType>, TensorInfo<long, IndexType>, int, int, IndexType, IndexType, long) [with T = float, IndexType = unsigned int, DstDi
m = 2, SrcDim = 2, IdxDim = -2]: block: [97,0,0], thread: [63,0,0] Assertion `srcIndex < srcSelectDimSize` failed.
THCudaCheck FAIL file=/tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/generic/THCTensorCopy.c line=18 error=59 : device-side assert triggered
/usr/local/bin/luajit: /usr/local/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] /usr/local/share/lua/5.1/nn/Container.lua:67: 
In 1 module of nn.Sequential:
/usr/local/share/lua/5.1/nn/Bottle.lua:22: cuda runtime error (59) : device-side assert triggered at /tmp/luarocks_cutorch-scm-1-5891/cutorch/lib/THC/generic/THCTensorCopy.c:18
stack traceback:
        [C]: in function 'copy'
        /usr/local/share/lua/5.1/nn/Bottle.lua:22: in function </usr/local/share/lua/5.1/nn/Bottle.lua:14>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/nn/Container.lua:63: in function 'rethrowErrors'
        /usr/local/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:356: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:333>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /usr/local/share/lua/5.1/threads/queue.lua:65: in function </usr/local/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /usr/local/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk

WARNING: If you see a stack trace below, it doesn't point to the place where this error occurred. Please use only the one above.
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/nn/Container.lua:67: in function 'rethrowErrors'
        /usr/local/share/lua/5.1/nn/Sequential.lua:44: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:380: in function 'func'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:345: in function 'neteval'
        /usr/local/share/lua/5.1/nngraph/gmodule.lua:380: in function 'forward'
        ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:356: in function <...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:333>
        [C]: in function 'xpcall'
        /usr/local/share/lua/5.1/threads/threads.lua:234: in function 'callback'
        /usr/local/share/lua/5.1/threads/queue.lua:65: in function </usr/local/share/lua/5.1/threads/queue.lua:41>
        [C]: in function 'pcall'
        /usr/local/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
        [string "  local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
        [C]: in function 'error'
        /usr/local/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
        /usr/local/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
        ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:371: in function 'doTrain'
        ...hare/lua/5.1/fairseq/torchnet/ResumableDPOptimEngine.lua:189: in function 'train'
        /usr/local/share/lua/5.1/fairseq/scripts/train.lua:404: in main chunk
        [C]: in function 'require'
        /usr/local/lib/luarocks/rocks/fairseq/scm-1/bin/fairseq:17: in main chunk
        [C]: at 0x004057a0

I trained another language pair successfully, but I cannot understand why this is not working

michaelauli commented 7 years ago

Did you check that you have no sentences longer than 1024 words? Please see https://github.com/facebookresearch/fairseq/issues/57

eenin commented 7 years ago

Yeah, that was the problem. Thanks!

happygirl123456 commented 7 years ago

I have meet the same issue ,but I don't konw how to limit sentence size in the training files. could you give me some help,Thanks! @michaelauli

michaelauli commented 7 years ago

This script removes sentence pairs whose source or target are longer than a specified number of words: https://github.com/moses-smt/mosesdecoder/blob/master/scripts/training/clean-corpus-n.perl