Open szha opened 4 years ago
This process is quite complex and has taken me quite some time to reproduce, so we should probably work on simplifying it. It's almost impossible for a newcomer to reproduce these steps just by following the code, and the process is not yet documented anywhere, so we need this to be documented in either cwiki or the mxnet site.
Description
Currently the instructions for interactive shell with the test environment on CI for reproducing and debugging tests is not documented on cwiki or the website. The process involves:
observe the failed pipeline and examine the log to find the test
find the Jenkins file that corresponds to the failed pipeline in ci/jenkins. In this case it's Jenkins_centos_gpu
find the failed test step in the Jenkins file, and the build step that produces the binary it uses. in this case the test step is https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkinsfile_centos_gpu#L44 and its corresponding build step is https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkinsfile_centos_gpu#L39
go to ci/jenkins/Jenkins_steps.groovy to find the corresponding build step https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L733-L745 note down the docker and runtime function to invoke. in this case it's https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L739
in the same file, find the test step https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L1014-L1025 note down the docker and runtime function. in this case it's https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/jenkins/Jenkins_steps.groovy#L1020
run the command to trigger the build of the tested binary with the above docker name and runtime function. in this case the command should be:
ci/build.py --platform centos7_gpu_cu102 /work/runtime_functions.sh build_static_libmxnet cu102
wait for the build to complete. afterwards, to run the complete test suite, run the test step command
ci/build.py --platform centos7_gpu_cu102 /work/runtime_functions.sh cd_unittest_ubuntu cu102
to launch interactive shell, assuming that you are in the root folder of mxnet git package, set the correct docker id and run
docker run -it --rm --gpus all -v $PWD:/work/mxnet mxnetci/build.centos7_gpu_cu102 /bin/bash
finally, run the commands as needed from the ci/docker/runtime_functions.sh, in this case https://github.com/apache/incubator-mxnet/blob/e2366e9102e6862416bf998af52baaa5e9c0a31b/ci/docker/runtime_functions.sh#L889-L927