Open nagyt2 opened 3 years ago
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
Description
After running our python program several hours it comes to a point when it just hangs. I was able to attach to the process with both strace and gdb. Strace showed that the process waits on a futex:
Then the bactrace in gdb clearly shows that the issue comes from MXNDArraySyncCopyToCPU (Please see lines 4-7):
Error Message
There is no error message, the application just hangs
To Reproduce
The issue comes sporadically sometimes after hours, sometimes after days, it's hard to reproduce
Environment
The runtime environment is based on the Jetson Nano box from NVidia. Processor is armv8, we have 4GB RAM. mxnet 1.6 w/ CUDA 10.0 are used.
Please note that the PYTHON version reported by diagnose.py differs from the one we are using running our application. (We are using pyenv, not the distribution delivered python package). We run the application using Python 3.7.7.
Environment Information
``` ----------Python Info---------- Version : 3.6.9 Compiler : GCC 8.4.0 Build : ('default', 'Jan 26 2021 15:33:00') Arch : ('64bit', 'ELF') ------------Pip Info----------- Version : 9.0.1 Directory : /usr/lib/python3/dist-packages/pip ----------MXNet Info----------- No MXNet installed. ----------System Info---------- Platform : Linux-4.9.140-tegra-aarch64-with-Ubuntu-18.04-bionic system : Linux node : sc1 release : 4.9.140-tegra version : #1 SMP PREEMPT Mon Dec 9 22:47:42 PST 2019 ----------Hardware Info---------- machine : aarch64 processor : aarch64 Architecture: aarch64 Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 1 Core(s) per socket: 4 Socket(s): 1 Vendor ID: ARM Model: 1 Model name: Cortex-A57 Stepping: r1p1 CPU max MHz: 1479.0000 CPU min MHz: 102.0000 BogoMIPS: 38.40 L1d cache: 32K L1i cache: 48K L2 cache: 2048K Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 ----------Network Test---------- Setting timeout: 10 Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0020 sec, LOAD: 0.5257 sec. Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0012 sec, LOAD: 0.2078 sec. Error open Gluon Tutorial(cn): https://zh.gluon.ai,If I can provide any more information in order to hunt this bug down, please do not hesitate to let me know!