apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

LSTM w CTCLoss error with float16 #15196

Open charlieyou opened 5 years ago

charlieyou commented 5 years ago

Description

An LSTM with CTCLoss fails when cast to float 16.

Environment info (Required)

----------Python Info----------
Version      : 3.6.5
Compiler     : GCC 7.2.0
Build        : ('default', 'Apr 29 2018 16:14:56')
Arch         : ('64bit', '')
------------Pip Info-----------
Version      : 10.0.1
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/pip
----------MXNet Info-----------
Version      : 1.5.0
Directory    : /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet
Commit Hash   : 134a3e8cd36ee66426deedd3c8add6888378c043
----------System Info----------
Platform     : Linux-4.14.114-82.97.amzn1.x86_64-x86_64-with-glibc2.9
system       : Linux
node         : ip-10-10-82-87
release      : 4.14.114-82.97.amzn1.x86_64
version      : #1 SMP Sun Apr 28 07:27:43 UTC 2019
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping:              1
CPU MHz:               2701.438
BogoMIPS:              4600.07
Hypervisor vendor:     Xen
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-3
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0017 sec, LOAD: 0.6884 sec.
Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.1339 sec, LOAD: 0.3958 sec.
Timing for Gluon Tutorial(cn): https://zh.gluon.ai, DNS: 0.1478 sec, LOAD: 0.4110 sec.
Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0270 sec, LOAD: 0.5201 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0032 sec, LOAD: 0.1016 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0016 sec, LOAD: 0.0433 sec.

Package used (Python/R/Scala/Julia): Python

Error Message:

(Paste the complete error message, including stack trace.)

---------------------------------------------------------------------------
MXNetError                                Traceback (most recent call last)
<ipython-input-69-6bfc070aed48> in <module>()
     13 
     14 loss.backward()
---> 15 l = mx.nd.mean(loss).asnumpy()

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/ndarray/ndarray.py in asnumpy(self)
   1994             self.handle,
   1995             data.ctypes.data_as(ctypes.c_void_p),
-> 1996             ctypes.c_size_t(data.size)))
   1997         return data
   1998 

~/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/base.py in check_call(ret)
    251     """
    252     if ret != 0:
--> 253         raise MXNetError(py_str(_LIB.MXGetLastError()))
    254 
    255 

MXNetError: [23:39:12] include/mxnet/././tensor_blob.h:236: Check failed: mshadow::DataType<DType>::kFlag == type_flag_: TBlob.get_with_shape: data type do not match specified type.Expected: 2 v.s. given 0
Stack trace:
  [bt] (0) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x4ac1eb) [0x7ff57de371eb]
  [bt] (1) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x30c8972) [0x7ff580a53972]
  [bt] (2) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x31dc115) [0x7ff580b67115]
  [bt] (3) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(mxnet::imperative::PushFCompute(std::function<void (nnvm::NodeAttrs const&, mxnet::OpContext const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&, std::vector<mxnet::TBlob, std::allocator<mxnet::TBlob> > const&)> const&, nnvm::Op const*, nnvm::NodeAttrs const&, mxnet::Context const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::engine::Var*, std::allocator<mxnet::engine::Var*> > const&, std::vector<mxnet::Resource, std::allocator<mxnet::Resource> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<mxnet::NDArray*, std::allocator<mxnet::NDArray*> > const&, std::vector<unsigned int, std::allocator<unsigned int> > const&, std::vector<mxnet::OpReqType, std::allocator<mxnet::OpReqType> > const&)::{lambda(mxnet::RunContext)#1}::operator()(mxnet::RunContext) const+0x307) [0x7ff57ffd9f47]
  [bt] (4) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x259adf4) [0x7ff57ff25df4]
  [bt] (5) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25a8789) [0x7ff57ff33789]
  [bt] (6) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25abbf0) [0x7ff57ff36bf0]
  [bt] (7) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25abe86) [0x7ff57ff36e86]
  [bt] (8) /home/ec2-user/anaconda3/envs/mxnet_p36/lib/python3.6/site-packages/mxnet/libmxnet.so(+0x25a6f94) [0x7ff57ff31f94]

Minimum reproducible example

(If you are using your own code, please provide a short script that reproduces the error. Otherwise, please provide link to the existing example.)

import mxnet as mx
from mxnet.gluon.rnn import LSTM

fake_data = mx.nd.random.uniform(shape=(1, 32, 32), dtype="float16").as_in_context(mx.gpu(0))
fake_label = mx.nd.random.uniform(shape=(1, 32), dtype="float16").as_in_context(mx.gpu(0))

lstm_layer = LSTM(32, dtype='float16')
lstm_layer.initialize(ctx=mx.gpu(0))

ctc_loss = mx.gluon.loss.CTCLoss()

with mx.autograd.record():
    x = lstm_layer(fake_data)
    loss = ctc_loss(x, fake_label)

loss.backward()
l = mx.nd.mean(loss).asnumpy()
mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug