apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.76k stars 6.8k forks source link

Scala GPU Build examples CI failure #15605

Open ChaiBapchya opened 5 years ago

ChaiBapchya commented 5 years ago

Scala unix GPU build error in an unrelated PR #15541

[INFO] ------------------------------------------------------------------------
[INFO] Building MXNet Scala Package - Examples INTERNAL
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO] 
[INFO] MXNet Scala Package - Parent ....................... SUCCESS [ 30.879 s]
[INFO] MXNet Scala Package - Initializer .................. SUCCESS [  9.859 s]
[INFO] MXNet Scala Package - Initializer Native ........... SUCCESS [  1.473 s]
[INFO] MXNet Scala Package - Macros ....................... SUCCESS [ 13.290 s]
[INFO] MXNet Scala Package - Native ....................... SUCCESS [  4.672 s]
[INFO] MXNet Scala Package - Core ......................... SUCCESS [01:54 min]
[INFO] MXNet Scala Package - Inference .................... SUCCESS [ 16.025 s]
[INFO] MXNet Scala Package - Examples ..................... FAILURE [  2.620 s]
[INFO] MXNet Scala Package - Spark ML ..................... SKIPPED
[INFO] Assembly Scala Package ............................. SKIPPED
[INFO] MXNet Scala Package - Full linux-x86_64-only ....... SKIPPED
[INFO] MXNet Scala Package - Full linux-x86_64-only ....... SKIPPED
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:16 min
[INFO] Finished at: 2019-07-18T05:52:51+00:00
[INFO] Final Memory: 47M/3263M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal on project mxnet-examples: Could not resolve dependencies for project org.apache.mxnet:mxnet-examples:jar:INTERNAL: Could not transfer artifact nu.pattern:opencv:jar:2.4.9-7 from/to central (https://repo.maven.apache.org/maven2): GET request of: nu/pattern/opencv/2.4.9-7/opencv-2.4.9-7.jar from central failed: Connection reset -> [Help 1]
[ERROR] 
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR] 
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException
[ERROR] 

Pipeline - http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-gpu/detail/PR-15541/5/pipeline/

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Scala, Test, CI, Build

ChaiBapchya commented 5 years ago

@mxnet-label-bot add [Scala, Test, CI, Build]

zachgk commented 5 years ago

@ChaiBapchya This issue is that the network connection failed while downloading a maven dependency. I don't know how we handle this besides rerunning the jenkins job

ChaiBapchya commented 5 years ago

Haven't looked at the code. But isn't there any way to catch exception and troubleshoot, instead of asking contributors to retriever CI?

zachgk commented 5 years ago

The problem doesn't lie on Maven so I don't think there is anything Maven can do to address it directly. Usually we would add some amount of retries for downloading. I just looked and there doesn't seem to be a way to make Maven retry downloading. I would be hesitant to retry the entire Maven test because this doesn't seem to be a frequent problem and it would immediately multiply the cost of the Scala tests.

We could try to do something to improve the error message. Currently, it is:

Failed to execute goal on project mxnet-examples: Could not resolve dependencies for project org.apache.mxnet:mxnet-examples:jar:INTERNAL: Could not transfer artifact nu.pattern:opencv:jar:2.4.9-7 from/to central (https://repo.maven.apache.org/maven2): GET request of: nu/pattern/opencv/2.4.9-7/opencv-2.4.9-7.jar from central failed: Connection reset -> [Help 1]

I feel like this error message is clear enough that it was some kind of networking problem. Are you thinking some kind of CI specific error messaging?

ChaiBapchya commented 5 years ago

Another one - #15794 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15794/3/pipeline/286

ChaiBapchya commented 5 years ago

@zachgk I just feel it is unreasonable to expect contributors/MXNet users to retrigger the PRs because something wasn't downloaded correctly.

Now specific to this issue - Error says GET request failed. Would be great if it adds way to solve it (currently retrigger CI, hopefully in future it auto-corrects itself)

But going forward, we need to make CI robust enough for connection failures.

ChaiBapchya commented 5 years ago

Another one - #15881

http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-15881/8/pipeline/286

zachgk commented 5 years ago

@perdasilva Any idea why the CI might be having problems connecting to maven?

ChaiBapchya commented 4 years ago

Another one for #16722 http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/mxnet-validation%2Funix-cpu/detail/PR-16722/5/pipeline