Open ChaiBapchya opened 4 years ago
Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: CI, Build
@mxnet-label-bot add [CI]
test_random.py:test_shuffle is taking a long time to run. I've seen cpu runtimes between 10 and 50 minutes for that test alone. I've developed a fix and piggy-backed it onto a pending PR of mine: https://github.com/apache/incubator-mxnet/pull/15882.
Another PR #15541 Python 3 CPU (runs for 4hours) before terminating! http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15541/8/pipeline/279
This is interesting and we need to figure out if the increased computation leads to the problem. @zixuanweeei could you help to take a look for the CI?
@pengzhao-intel I've seen cpu runtimes more than 10 minutes by testing test_random.py:test_shuffle
for three times. Seems there are lots of discussions on the shuffle operator, like PR #10048, PR #15882 and ISSUE #10277. I will take some surveys on them first.
Thanks @zixuanweeei
Could we statistic and sort the runtime for all cases in CPU side (CPU, CPU+MKL, CPU+MKLDNN)? After that, we can see the runtime change by a new PR like @ChaiBapchya's large tensor PR.
Sure. @pengzhao-intel
BTW, I have disabled MKLDNN subgraph backend to see whether it impacts on the efficiency of shuffle operator. The results showed the shuffle operator has the same time cost w/ and w/o MKLDNN subgraph backend.
Some fixes from PR #15882 and PR #15922 (they have the same fixes on test_shuffle
) has reduced the cost from more than 10 mins to no more than 2 mins. It cost ~41s on local test. And it seems that the fixes on test_shuffle
doesn't alter the functionality of the test. It just drops the needless equal assertions.
From the last comment by @ChaiBapchya, we also found that test_operator.test_convolution_independent_gradients
costed too much. And that test was conducted on a library compiled with MKL-DNN. So it will cost more on CPU context when MXNet is compiled without MKL-DNN. Should PR #15922 work for test_shuffle, we would reduce the cost from test_operator.test_convolution_independent_gradients
.
4 hr timeout on the python3 mkldnn-mkl-cpu test. Why is this test still active? It causes a lot of issues with getting PRs through the pipeline. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16342/2/pipeline/266
4 hr timeout again! MKL CPU http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16336/6/pipeline/263
I am leaning towards disabling this test until timeout issue for mkldnn is fixed! @aaronmarkham
Python 3 MKL CPU timeout >3hr timeout
Shell script runs for 3h http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/2/pipeline/281/
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/1/pipeline/283
http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/6/pipeline But what's the cause?
PR #15794 doesn't make any change to C API.