apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Several crashes are found in mxnet version 1.5.x 1.4.x 1.3.x #21019

Open rubbberrabbit opened 2 years ago

rubbberrabbit commented 2 years ago

Description

Hello, we try to use keras as the front-end to run Mxnet, but find several Mxnet crashes, we are not should if there is a real bug trigger by those model, so we collected the Execution stack information when Mxnet crashes, most of them are related to libmxnet.so which is hard to compile in debug mode. Here is the list of the Mxnet version and Execution stack information of our models.

Further, the triggering-crash models and replay script is provided in https://drive.google.com/drive/folders/1he3I-1PKGI01t09E2FAin0_mUnmu2-oz?usp=sharing

Error Message

Some of stack informations are shown below image-20220504220358522 image-20220504220358524 image-20220504222003296

To Reproduce

Steps to reproduce

(Paste the commands you ran that produced the error.)

  1. Download the scripts and models from the cloud links
  2. Chang the path in /scripts/bugs_replay.conf into the path of model folder
  3. Run mxnet_test.py in the corresponding environment

What have you tried to solve it?

To analysis the Execution stack information in libmxnet.so, we try to compile Mxnet with choice Debug=1 in config.mk but face a error report of "relocation trcuncated to fit" in several different environments. we assume that is because too much redundant code is added when compiling with DEBUG mode.

Environment

Environment Information Mxnet 1.5.1 Keras-Mxnet 2.2.4.2 CUDA 10.1 python 3.6.12 Mxnet 1.4.1 Keras-Mxnet 2.2.4.2 CUDA 10.0 python 3.6.12 Mxnet 1.3.1 Keras-Mxnet 2.2.4.2 CUDA 9.0 python 3.6.12
github-actions[bot] commented 2 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.