dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 535 forks source link

Embedding input contains data out of bound when sparse_grad=True #938

Open mohammedkhalilia opened 5 years ago

mohammedkhalilia commented 5 years ago

Description

When setting sparse_grad=True in mxnet.gluon.nn.Embedding() I get an error.

Error Message

The error is: Check failed: is_valid: Embedding input contains data out of bound

Full traceback is below:

Traceback (most recent call last): File "/home/ubuntu/workspace/src/models/train.py", line 73, in main() File "/home/ubuntu/workspace/src/models/train.py", line 63, in main model.train(train_dataloader, val_dataloader, test_dataloader, ctx) File "BaseModel.py", line 53, in train epoch_loss = self.epoch() File "/home/ubuntu/workspace/src/models/BaseModel.py", line 120, in epoch return epoch_loss.asscalar() File "/env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 2014, in asscalar return self.asnumpy()[0] File "/env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/ndarray/ndarray.py", line 1996, in asnumpy ctypes.c_size_t(data.size))) File "/env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/base.py", line 253, in check_call raise MXNetError(py_str(_LIB.MXGetLastError())) mxnet.base.MXNetError: [19:41:00] src/operator/tensor/indexing_op.cu:284: Check failed: is_valid: Embedding input contains data out of bound

[bt] (0) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x4b04cb) [0x7fbdc45f84cb] [bt] (1) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(void mxnet::op::SparseEmbeddingDeterministicKernelLaunch<int, float, long>(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType, mxnet::NDArray const&)+0x246) [0x7fbdc8a613d6] [bt] (2) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(mxnet::op::SparseEmbeddingOpBackwardDeterministicRspImpl(mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType, mxnet::NDArray const&)+0x1b4b) [0x7fbdc8ab434b] [bt] (3) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(void mxnet::op::SparseEmbeddingOpBackwardRspImpl(bool, mxnet::OpContext const&, mxnet::TBlob const&, mxnet::TBlob const&, mxnet::OpReqType, mxnet::NDArray const&)+0x2f4) [0x7fbdc8ab5084] [bt] (4) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(void mxnet::op::EmbeddingOpBackwardEx(nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocator > const&, std::vector<mxnet::NDArray, std::allocator > const&)+0x6dc) [0x7fbdc8abaa1c] [bt] (5) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(std::_Function_handler<void (mxnet::RunContext), mxnet::imperative::PushFComputeEx(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::NDArray, std::allocator > const&, std::vector<mxnet::OpReqType, std::allocator > const&, std::vector<mxnet::NDArray, std::allocator > const&)> const&, nnvm::Op const, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var, std::allocator<mxnet::engine::Var> > const&, std::vector<mxnet::engine::Var, std::allocator<mxnet::engine::Var> > const&, std::vector<mxnet::Resource, std::allocator > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray> > const&, std::vector<mxnet::NDArray, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::OpReqType, std::allocator > const&)::{lambda(mxnet::RunContext)#1}>::_M_invoke(std::_Any_data const&, mxnet::RunContext)+0x9f) [0x7fbdc67a2f2f] [bt] (6) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x25b5459) [0x7fbdc66fd459] [bt] (7) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x25c1ce1) [0x7fbdc6709ce1] [bt] (8) /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet/libmxnet.so(+0x25c51f0) [0x7fbdc670d1f0]

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

Steps to reproduce

  1. Set sparse_grad=True in the embedding as follows:
    char_embedding = mxnet.gluon.nn.Embedding(
                    config.char_size,
                    config.char_emb_dim,
                    sparse_grad=True,
                    prefix='char_embed_')
  2. Model is initialized using mxnet.init.Xavier()
  3. Embedding are initialized using glove.6B.50d.txt
  4. hybridize(active=False)

I can provide a simple end-to-end script if needed.

Environment

----------Python Info----------
Version      : 3.5.2
Compiler     : GCC 5.4.0 20160609
Build        : ('default', 'Nov 12 2018 13:43:14')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.2.2
Directory    : /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /env/mx_1.5_gnlp_0.8/local/lib/python3.5/site-packages/mxnet
Num GPUs     : 8
Commit Hash   : 75a9e187d00a8b7ebc71412a02ed0e3ae489d91f
----------System Info----------
Platform     : Linux-4.4.0-1090-aws-x86_64-with-Ubuntu-16.04-xenial
system       : Linux
node         : ip-172-31-30-122
release      : 4.4.0-1090-aws
version      : #101-Ubuntu SMP Fri Aug 2 15:21:01 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                64
On-line CPU(s) list:   0-63
Thread(s) per core:    2
Core(s) per socket:    16
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               1202.109
CPU max MHz:           3000.0000
CPU min MHz:           1200.0000
BogoMIPS:              4600.15
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-15,32-47
NUMA node1 CPU(s):     16-31,48-63
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc aperfmperf pni pclm$lqdq monitor est ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx x$aveopt ida

----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0025 sec, LOAD: 0.6521 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0004 sec, LOAD: 0.4032 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0004 sec, LOAD: 0.0453 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0004 sec, LOAD: 0.0216 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0003 sec, LOAD: 0.0177 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0005 sec, LOAD: 0.0575 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0013 sec, LOAD: 0.1462 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.0004 sec, LOAD: 0.1874 sec.
leezu commented 5 years ago

Hi @mohammedkhalilia , please provide an end-to-end example. My guess is that one of your input elements is >= config.char_size. Can you double check all inputs are < config.char_size?