apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.73k stars 6.81k forks source link

CI tests timing out #16524

Open aaronmarkham opened 4 years ago

aaronmarkham commented 4 years ago

CI really needs some attention. I had three PRs yesterday and they all failed on one test or another due to timeouts. Tests need to be broken up or streamlined. It shouldn't take 4 hours for tests to run and then timeout.

I'm flagging the 1.5 GB imagenet model and related tests. I think these should be moved to nightly. https://github.com/apache/incubator-mxnet/blob/master/cpp-package/tests/ci_test.sh#L69-L70

GPU: CUDA10.1+cuDNN7 - 3 hour timeout http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16500/15/pipeline/46

dist-kvstore tests GPU - 3 hour timeout http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16512/1/pipeline

Python2: CPU 4 hour timeout http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16514/1/pipeline/260

mxnet-label-bot commented 4 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended label(s): Test, CI

aaronmarkham commented 4 years ago

Also this one had 3 timeouts: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16496/3/pipeline/45

hgt312 commented 4 years ago

https://github.com/apache/incubator-mxnet/issues/16422 Timeout too.

aaronmarkham commented 4 years ago

Another on unix-gpu, kvstore http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16512/4/pipeline/

roywei commented 4 years ago

Unix-gpu kvstore is fixed now, next is looking into other timeouts

aaronmarkham commented 4 years ago

Timeout on GPU: CMake TVM_OP OFF http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-16598/2/pipeline/54