apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

[v1.x] [MAC] test_operator.test_index_copy causes `bus error`, possible segfault #19101

Open r3stl355 opened 4 years ago

r3stl355 commented 4 years ago

Description

This only happens on my Mac on v1.x. The same test on v2.0 runs and passes. I can also see that the same test succeeds on centos-cpu CI. Other posts for this type error i found online suggest possible segfault as an underlying problem.

To reproduce

nosetests tests/python/unittest/test_operator.py:test_index_copy

Environment

OS: Catalina: 10.15.6

clang --version: Apple clang version 12.0.0 (clang-1200.0.31.1) Target: x86_64-apple-darwin19.6.0 Thread model: posix

What have you tried to solve it?

The bus error is thrown at this line: https://github.com/apache/incubator-mxnet/blob/8dbed966e35b979d8f770b0d5b0ec9f707b3a2f1/tests/python/unittest/test_operator.py#L5711

Commenting out https://github.com/apache/incubator-mxnet/blob/8dbed966e35b979d8f770b0d5b0ec9f707b3a2f1/tests/python/unittest/test_operator.py#L5709 prevents the error but then test fails at line 5711 when comparing gradents. The same is for line 5717 and subsequent assertions

github-actions[bot] commented 4 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

szha commented 4 years ago

@r3stl355 thanks for reporting the issue. Could you set environment variables DMLC_LOG_STACK_TRACE_DEPTH=150 MXNET_ENGINE_TYPE=NaiveEngine and run the same program again and share the stacktrace?

r3stl355 commented 4 years ago

Thank you @szha , MXNET_ENGINE_TYPE=NaiveEngine did the trick, no more errors, I'm now going to read-up to understand what it does

szha commented 4 years ago

@r3stl355 it forces the execution to be synchronous, and if it resolves the issue it means there's a race condition that needs to be resolved.