apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

test_numpy_op.py::test_np_empty_like hangs #18144

Open szha opened 4 years ago

szha commented 4 years ago

Description

test_numpy_op.py::test_np_empty_like hangs on unix-gpu

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/59/pipeline/425

leezu commented 4 years ago

In the linked CI run test_numpy_op.py::test_np_empty_like is not run and thus can't be responsible for the hang. Thus there must be more triggers besides test_np_empty_like.

Related https://github.com/apache/incubator-mxnet/issues/18090

haojin2 commented 4 years ago

Wow this issue is VERY INTERESTING, in the first link given in the issue description I'm not even seeing test_np_empty_like being run at all, and the last test run before the final timeout process kill was test_np_bincount. Also as @leezu pointed out in the above comment, even removing test_np_empty_like does not solve the issue. So to conclude, so far I'm not seeing any solid evidence supporting test_np_empty_like to be the root cause for the hang. To be clear, I'm not saying that I don't think we should re-implement empty_like with a native implementation in the future, simply want to suggest that maybe you guys are attacking the wrong target at this moment.

leezu commented 4 years ago

@haojin2 you can check #18090 for the evidence. In the above commit, the problem is that only empty_like is disabled but not the other numpy operators relying on CustomOp. Doing that in https://github.com/apache/incubator-mxnet/pull/18151 CI passed without hang 2 times in a row so far. You're right that this doesn't fix the root-cause. The objective here is to restore CI stability

haojin2 commented 4 years ago

@leezu I understand the goal, but my point is that we should avoid providing un-related info in the issue's description (the hang in the first provided link is not related at all), shouldn't we? It'd be better if link to #18090 was provided in the first place to avoid such confusions, don't you agree?

leezu commented 4 years ago

I agree. #18090 should have been linked but may have been missed unintentionally