apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

[CI] unix cpu validation Timeout #15880

Open ChaiBapchya opened 4 years ago

ChaiBapchya commented 4 years ago

Python 3 MKL CPU timeout >3hr timeout

Shell script runs for 3h http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/2/pipeline/281/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/1/pipeline/283

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/6/pipeline But what's the cause?

PR #15794 doesn't make any change to C API.

mxnet-label-bot commented 4 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: CI, Build

ChaiBapchya commented 4 years ago

@mxnet-label-bot add [CI]

ChaiBapchya commented 4 years ago

Another PR #15785 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15785/8/pipeline

Python3 MKLDNN MKL CPU http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15785/6/pipeline/284

ChaiBapchya commented 4 years ago

Another PR #15881 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15881/7/pipeline

ChaiBapchya commented 4 years ago

Another PR #15769 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15769/6/pipeline

DickJC123 commented 4 years ago

test_random.py:test_shuffle is taking a long time to run. I've seen cpu runtimes between 10 and 50 minutes for that test alone. I've developed a fix and piggy-backed it onto a pending PR of mine: https://github.com/apache/incubator-mxnet/pull/15882.

ChaiBapchya commented 4 years ago

Another PR #15541 Python 3 CPU (runs for 4hours) before terminating! http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15541/8/pipeline/279

pengzhao-intel commented 4 years ago

This is interesting and we need to figure out if the increased computation leads to the problem. @zixuanweeei could you help to take a look for the CI?

zixuanweeei commented 4 years ago

@pengzhao-intel I've seen cpu runtimes more than 10 minutes by testing test_random.py:test_shuffle for three times. Seems there are lots of discussions on the shuffle operator, like PR #10048, PR #15882 and ISSUE #10277. I will take some surveys on them first.

pengzhao-intel commented 4 years ago

Thanks @zixuanweeei

Could we statistic and sort the runtime for all cases in CPU side (CPU, CPU+MKL, CPU+MKLDNN)? After that, we can see the runtime change by a new PR like @ChaiBapchya's large tensor PR.

zixuanweeei commented 4 years ago

Sure. @pengzhao-intel

BTW, I have disabled MKLDNN subgraph backend to see whether it impacts on the efficiency of shuffle operator. The results showed the shuffle operator has the same time cost w/ and w/o MKLDNN subgraph backend.

zixuanweeei commented 4 years ago

Some fixes from PR #15882 and PR #15922 (they have the same fixes on test_shuffle) has reduced the cost from more than 10 mins to no more than 2 mins. It cost ~41s on local test. And it seems that the fixes on test_shuffle doesn't alter the functionality of the test. It just drops the needless equal assertions.

ChaiBapchya commented 4 years ago

Another one #15736 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15736/10/

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15736/11/pipeline/291

zixuanweeei commented 4 years ago

From the last comment by @ChaiBapchya, we also found that test_operator.test_convolution_independent_gradients costed too much. And that test was conducted on a library compiled with MKL-DNN. So it will cost more on CPU context when MXNet is compiled without MKL-DNN. Should PR #15922 work for test_shuffle, we would reduce the cost from test_operator.test_convolution_independent_gradients.

aaronmarkham commented 4 years ago

4 hr timeout on the python3 mkldnn-mkl-cpu test. Why is this test still active? It causes a lot of issues with getting PRs through the pipeline. http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16342/2/pipeline/266

ChaiBapchya commented 4 years ago

4 hr timeout again! MKL CPU http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16336/6/pipeline/263

16336 is a step towards getting conclusive evidence towards perennially slow unittests. Hopefully we get clarity onto it once that PR is merged.

I am leaning towards disabling this test until timeout issue for mkldnn is fixed! @aaronmarkham

ChaiBapchya commented 4 years ago

Another one - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16625/1/pipeline/300