ROCm / tensorflow-upstream

TensorFlow ROCm port
https://tensorflow.org
Apache License 2.0
684 stars 94 forks source link

Enhancing TensorFlow CI Tests #2698

Open ScXfjiang opened 1 week ago

ScXfjiang commented 1 week ago

TensorFlow’s CI tests sometimes fail for no reason, requiring a rerun. Each round takes approximately 4 hours. Any potential improvements we can do to optimize this process?

i-chaochen commented 1 week ago

The main reason is because few CI nodes running too many unit tests at once and the ROCm driver will drop.

I think we could use same way in XLA CI to parallel the number of TF's unit tests at two pipelines.

Right now TF' CI (pycpp) is triggering the job via bazelrc

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each individual CI pipeline can read it and run the job in parallel.

jayfurmanek commented 1 week ago

We need to think of a good way to separate all unit tests in each bazelrc (maybe create another one?), and then each individual CI pipeline can read it and run the job in parallel.

I'm not sure that will help. We may be able to throttle the test rate. At the moment the jobs use --run_under=//tensorflow/tools/ci_build/gpu_build:parallel_gpu_execute

and the env vars TF_TESTS_PER_GPU, N_TEST_JOBS and TF_GPU_COUNT are used there. The idea is to run one test per GPU. It checks rocm-smi to get the GPU count. There have been changes in rocm-smi output lately but I fixed them for the scripts.

jayfurmanek commented 1 week ago

Maybe we throttle to use less GPUs?

i-chaochen commented 1 week ago

There is other issue is our tensorflow CI taking long time to finish all unit tests. We think maybe we could reduce the number of unit tests as a half for 2 parallel CI jobs to reduce the CI time.