Open szha opened 4 years ago
In the linked CI run test_numpy_op.py::test_np_empty_like
is not run and thus can't be responsible for the hang. Thus there must be more triggers besides test_np_empty_like
.
Related https://github.com/apache/incubator-mxnet/issues/18090
Wow this issue is VERY INTERESTING, in the first link given in the issue description I'm not even seeing test_np_empty_like
being run at all, and the last test run before the final timeout process kill was test_np_bincount
. Also as @leezu pointed out in the above comment, even removing test_np_empty_like
does not solve the issue. So to conclude, so far I'm not seeing any solid evidence supporting test_np_empty_like
to be the root cause for the hang.
To be clear, I'm not saying that I don't think we should re-implement empty_like
with a native implementation in the future, simply want to suggest that maybe you guys are attacking the wrong target at this moment.
@haojin2 you can check #18090 for the evidence. In the above commit, the problem is that only empty_like
is disabled but not the other numpy operators relying on CustomOp. Doing that in https://github.com/apache/incubator-mxnet/pull/18151 CI passed without hang 2 times in a row so far. You're right that this doesn't fix the root-cause. The objective here is to restore CI stability
@leezu I understand the goal, but my point is that we should avoid providing un-related info in the issue's description (the hang in the first provided link is not related at all), shouldn't we? It'd be better if link to #18090 was provided in the first place to avoid such confusions, don't you agree?
I agree. #18090 should have been linked but may have been missed unintentionally
Description
test_numpy_op.py::test_np_empty_like hangs on unix-gpu
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-18025/59/pipeline/425