apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

CI timeout on unix-cpu Python2 test #16995

Open aaronmarkham opened 4 years ago

aaronmarkham commented 4 years ago

Description

This test hit the 4 hour timeout on a couple of recent runs, blocking PRs. unix-cpu > Python2

Occurrences

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16980/6/pipeline/294 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16986/2/pipeline/

What have you tried to solve it?

  1. Restarted the test.
apeforest commented 4 years ago

The latest PR passes in 1 hr 37mins. https://github.com/apache/incubator-mxnet/pull/16992 It does seem quite flaky.

larroy commented 4 years ago

I diagnosed some of the timeouts to come from EFS rate limit. EFS shared ccache is currently disabled in the master CI for this reason. I don't like shared state but having a shared EFS ccache might indeed provide some value which should be measured, but the size was too huge and too many small files overload EFS IO. In this case, from what I see the test itself is taking 4h, and the cache is disabled.

I would suggest to check the test output http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-16986/runs/2/nodes/294/steps/685/log/?start=0 run locally on the same type of instance and compare the test times to see the durations of each test, say sort two columns in excel and see if there's some test that is getting stuck and slowing down the full suite.

larroy commented 4 years ago

Compare with a normal run which takes 1:30 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/master/1350/pipeline/294

szha commented 4 years ago

Let's use the flaky label just for tests.

leezu commented 4 years ago

Happened again to 2 PRs (16971, 17018)

leezu commented 4 years ago

Can we use https://aws.amazon.com/about-aws/whats-new/2018/07/amazon-efs-now-supports-provisioned-throughput/ as a short-term solution?

larroy commented 4 years ago

EFS is disabled now. I think the throughput was set to max but there's too many small files and the cache was too big. Maybe with a more reasonable cache size it would work. This means it's not the only cause of timeouts, could be an IO bottleneck in Jenkins or somewhere else. Could the VPC suffer from an IO limit?

larroy commented 4 years ago

@leezu could you link to the PRs? Is there something interesting in the logs?

larroy commented 4 years ago

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16976/2/pipeline/294

marcoabreu commented 4 years ago

Reading this thread really hurts. Can we please have less guessing and more rootcausing?

How should a unit test be related to ccache, considering it's not a compile task? Why should network or disk io be a problem in a suite of tests? Even if there was limited io, I've never heard of a case where your request gets terminated or put into a deadlock in case of hitting the limit. AWS always uses throttling.

MXNet is a highly parallel piece of software. If something in tests was about to get stuck intermittently, my first thought would be to actually look into my software under test and not just plain out blame the infrastructure. You wouldn't blame a processor in case your program prints a wrong value, right? So please, for the sake of the entire project, can we stop the guessing game and stop turning of production systems as a measure to apparently resolve or root cause an issue?

leezu commented 4 years ago

@leezu could you link to the PRs? Is there something interesting in the logs?

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16971/8/pipeline http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17018/5/pipeline/294

I didn't check the logs in detail, but both are aborted with Sending interrupt signal to process.

cjolivier01 commented 4 years ago

notice the last test is (or is very near to) test_nump_op.test_np_unique so what’s the next test after that? that could be hanging there until system kills it.

run the numpy tests locally

i don’t see anything which would indicate an infrastructure problem considering it’s only one test run that fails and it fails in basically the same place every time. I am curious how this conclusion was reached.

szha commented 4 years ago

we should probably use parallel testing with naive engine for operator tests, so as not to have such spillage of failure.

larroy commented 4 years ago

@marcoabreu you didn't read well, because nobody wrote that it was related to ccache nor that it was disabled because this test, that's why it hurts. You might want to try to reproduce the test run with docker, not only enlightening us with your comments of infinite wisdom. If this doesn't happen locally with the same AMI it's either related to the worker or the communication to the master, would that be a valid assumption?

We have open positions if you are excited about maintaining this system. Feel free to send me your CV, we are hiring.

larroy commented 4 years ago

This is what I wrote above for everyones convenience:

I would suggest to check the test output http://jenkins.mxnet-ci.amazon-ml.com/blue/rest/organizations/jenkins/pipelines/mxnet-validation/pipelines/unix-cpu/branches/PR-16986/runs/2/nodes/294/steps/685/log/?start=0 run locally on the same type of instance and compare the test times to see the durations of each test, say sort two columns in excel and see if there's some test that is getting stuck and slowing down the full suite.

Also my browsers are hanging while trying to load the links pasted here, thanks Jenkins.

TaoLv commented 4 years ago

Timeout again: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-17355/3/pipeline/294

At the same place:

test_numpy_op.test_np_unique ... ok (0.2147s)
Sending interrupt signal to process
larroy commented 4 years ago

Thanks the team will look into it in the following days.