apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Instructions for interactive shell with the test environment on CI for reproducing and debugging tests #18723

Open szha opened 4 years ago

szha commented 4 years ago

Description

Currently the instructions for interactive shell with the test environment on CI for reproducing and debugging tests is not documented on cwiki or the website. The process involves:

  1. observe the failed pipeline and examine the log to find the test image

  2. find the Jenkins file that corresponds to the failed pipeline in ci/jenkins. In this case it's Jenkins_centos_gpu

  3. find the failed test step in the Jenkins file, and the build step that produces the binary it uses. in this case the test step is https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkinsfile_centos_gpu#L44 and its corresponding build step is https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkinsfile_centos_gpu#L39

  4. go to ci/jenkins/Jenkins_steps.groovy to find the corresponding build step https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L733-L745 note down the docker and runtime function to invoke. in this case it's https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L739

  5. in the same file, find the test step https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L1014-L1025 note down the docker and runtime function. in this case it's https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L1020

  6. run the command to trigger the build of the tested binary with the above docker name and runtime function. in this case the command should be: ci/build.py --platform centos7_gpu_cu102 /work/runtime_functions.sh build_static_libmxnet cu102

  7. wait for the build to complete. afterwards, to run the complete test suite, run the test step command ci/build.py --platform centos7_gpu_cu102 /work/runtime_functions.sh cd_unittest_ubuntu cu102

  8. to launch interactive shell, assuming that you are in the root folder of mxnet git package, set the correct docker id and run docker run -it --rm --gpus all -v $PWD:/work/mxnet mxnetci/build.centos7_gpu_cu102 /bin/bash

  9. finally, run the commands as needed from the ci/docker/runtime_functions.sh, in this case https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/docker/runtime_functions.sh#L889-L927

szha commented 4 years ago

This process is quite complex and has taken me quite some time to reproduce, so we should probably work on simplifying it. It's almost impossible for a newcomer to reproduce these steps just by following the code, and the process is not yet documented anywhere, so we need this to be documented in either cwiki or the mxnet site.