Open r3stl355 opened 4 years ago
Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.
@r3stl355 thanks for reporting the issue. Could you set environment variables DMLC_LOG_STACK_TRACE_DEPTH=150 MXNET_ENGINE_TYPE=NaiveEngine
and run the same program again and share the stacktrace?
Thank you @szha , MXNET_ENGINE_TYPE=NaiveEngine
did the trick, no more errors, I'm now going to read-up to understand what it does
@r3stl355 it forces the execution to be synchronous, and if it resolves the issue it means there's a race condition that needs to be resolved.
Description
This only happens on my Mac on v1.x. The same test on v2.0 runs and passes. I can also see that the same test succeeds on
centos-cpu
CI. Other posts for this type error i found online suggest possible segfault as an underlying problem.To reproduce
nosetests tests/python/unittest/test_operator.py:test_index_copy
Environment
OS: Catalina: 10.15.6
clang --version
: Apple clang version 12.0.0 (clang-1200.0.31.1) Target: x86_64-apple-darwin19.6.0 Thread model: posixWhat have you tried to solve it?
The bus error is thrown at this line: https://github.com/apache/incubator-mxnet/blob/8dbed966e35b979d8f770b0d5b0ec9f707b3a2f1/tests/python/unittest/test_operator.py#L5711
Commenting out https://github.com/apache/incubator-mxnet/blob/8dbed966e35b979d8f770b0d5b0ec9f707b3a2f1/tests/python/unittest/test_operator.py#L5709 prevents the error but then test fails at line 5711 when comparing gradents. The same is for line 5717 and subsequent assertions