ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.05k stars 219 forks source link

Investigating CI Instability #1147

Open atamazov opened 3 years ago

atamazov commented 3 years ago

This is umbrella ticket intended to collect & analyze information related to CI instability that we currently observe.

A list of tickets that may affect the CI stability

Findings from CI runs launched in a row

atamazov commented 3 years ago

A list of tickets that may affect the CI stability

atamazov commented 3 years ago

Statistics of several CI runs launched in a row

Runs 287-289 are of the `develop` branch, others are of `ci-instability-investigation` CI# | Re-run of | Result | Failing stage | Node | Failing test | Error message | Reboot after -- | -- | -- | -- | -- | -- | -- | -- 287 | c636bf2 | :black_circle: | Fp32 Hip Debug COMGR | ixt-rack-55 | test_tensor_trans | MAF * | ? 288 | 287 | :red_circle: | Fp16 Hip MLIR gfx908 | v340i-2 * | (build) | RCF * | F 289 | 288 | :green_circle: |   |   |   |   |   1 | c636bf2 | :red_circle: | Fp16 Hip All Install gfx908 | v340i-1 | (build) | RCF | F 2 | 1 | Terminated |   |   |   |   |   3 | 8b2f260 | :green_circle: |   |   |   |   |   4 | 8b2f260 | :green_circle: | | | | | 5 | 8b2f260 | :green_circle: | | | | | 6 | 8b2f260 | :green_circle: | | | | | 7 | 8b2f260 | :green_circle: | | | | | :warning: | :warning: | :warning: | JENKINS UPGRADE | :warning: | :warning: | :warning: | :warning: 8 | 8b2f260 | :red_circle: |   |   | (build) | download failure (boost) |   9 | 8b2f260 | :green_circle: | | | | | 10 | 8b2f260 | :green_circle: | | | | | 11 | 8b2f260 | :green_circle: | | | | | - Abbreviations: - v340i-2 = rocm-frameworks-v340i-2.amd.com - MAF = Memory access fault - RCF = Cannot contact... hudson.remoting.ChannelClosingException... Remote call failed - ? = Unknown

develop branch

CI# commit Res Failing stage GPU BE Node Failing test Error message
311 068bb6b1 :green_circle: -     - - -
312 502df2b :red_circle: Fp32 Debug /opt/rocm MI100 Hip trex-vg-20 (build) Fetch
313 R 312 :red_circle: Fp32 Install All Vega20 OCL prj47-rack-91 test_soft_max FAILED: 0.0442317
314 R 313 :red_circle: Bf16 nstall All gfx90a Hip dell113-r01-u31-32 (build) GPU detection
315 R 314 :black_circle: - -   - - Aborted by user
316 R 313 :green_circle: - -   - - -
317 2509c87 :red_circle: Fp16 /opt/rocm gfx90a Hip dell113-r01-u05-06 (build) GPU detection
318 1a52c62 :black_circle: Fp32 /opt/rocm Vega10 Hip prj47-rack-19 test_? Timeout 5 hours
319 3b1b3ac :red_circle: Fp32 Debug /opt/rocm gfx90a Hip dell113-r01-u31-32 (build) GPU detection
320 2d7b2dd :red_circle: Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_group FRDI *
321.. ..322 :green_circle: - -   - - -
323 030a369 :black_circle: Fp32 MLIR Vega10 Hip prj47-rack-19 test_? Timeout 5 hours
324 R 323 :green_circle: - -   - - -
325 1a1df9f :black_circle: Fp32 Debug Vega10 Hip prj47-rack-09 clinfo Timeout 5 minutes
326 R 325 :red_circle: Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_for_implicit_gemm FRDI *
327 R 326 :red_circle: Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv_group FRDI *
328.. ..329 :green_circle: - -   - - -
330 bc464d5 :red_circle: Fp32 All Vega20 /64 Hip ixt-rack-54 test_conv3d FRDI
331.. ..334 :green_circle: - -   - - -
335 f21cdc1 :red_circle: Bf16 Install gfx908 Hip pytorch-vg20-1 (REBOOT SLAVES!) connect timed out
336 886fc21 :red_circle: Fp16 MLIR gfx908 Hip MI100-5 (REBOOT SLAVES!) api.github.com
337.. ..338 :green_circle: - -   - - -
339 568c2e4 :red_circle: - -   hpe-rack-16 docker build (boost download)
340 R 339 :red_circle: - -   hpe-rack-16 docker build (boost download)
341 R 340 :black_circle: - -   - - (aborted by user)
342 R 341 :red_circle: Fp32 Debug + Codecov Vega20 OCL prj47-rack-91 test_sqlite_perfdb runtime error: index 624 out of bounds...
343 R 342 :red_circle: Bf16 /opt/rocm gfx90a Hip dell113-r01-u31-32 build GPU detection
344.. ..347 :green_circle: - -   - - -
348 56215d6 :red_circle: Int8 All Vega20 /64 Hip ixt-rack-54 test_tensor_vec Iteration: 24 \ Mismatch at 4736223: ! != )
349 R 348 :red_circle: Fp16 All Install gfx908 Hip MI100-5 test_regression_half_mi100 (error as expected)
350 12e52ed :red_circle: Fp32 All gfx908 Hip v340l-3 test_conv_group (none - INTERNAL ERROR?)
351.. ..360 - :green_circle: - -   - - -
361 e0ded03 :red_circle: Fp32 gfx90a OpenCL hpe-rack-16 build GPU detection
362 8498875 :red_circle: Fp32 gfx90a OpenCL hpe-rack-15 build GPU detection
363 8498875 :black_circle: - - - - - (aborted by Jun)
364 e0ded03 :black_circle: - - - - - (aborted by Jun)
365 R 363 :green_circle: - -   - - -
366 R 364 :green_circle: - -   - - -
367 f091329 :red_circle: Fp32 gfx90a Hip dell113-r01-u31-32 build GPU detection
368 R 367 :green_circle: - -   - - -
369.. ..373 - :black_circle: - -   - - (aborted by Artem)
374 R 373 :green_circle: - -   - - -
375 d4f48bd :red_circle: Fp32 gfx908 OCL pytorch-vg20-1 mlir testing (tests broken)
376 0a095af :red_circle: Fp32 gfx908 OCL ixt-sjc2-11 (REBOOT SLAVES!) n/a
377 R 376 :green_circle: - -   - - -
378 5cb2e54 :red_circle: Fp32 All gfx1030 OCL ixt-sjc2-16 docker build permission denied... Docker daemon socket
379 R 378 :green_circle: - -   - - -
junliume commented 3 years ago

@atamazov I have brought down the following two gfx908 nodes temporarily, because the RCF issues are happening on these two nodes too often: rocm-frameworks-v340i-1.amd.com rocm-frameworks-v340i-2.amd.com

atamazov commented 3 years ago

@junliume https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/285 created

atamazov commented 3 years ago

@junliume Please do NOT change the ci-instability-investigation branch. The idea is to run the same good commit several times in a row. ~I will revert the changes.~

atamazov commented 3 years ago

@junliume Aha, I see. You would like to exclude Full gfx1030 tests. Okay. Let's keep this.

atamazov commented 2 years ago

Testing on Vega10 (prj47-rack-19) often ends with 5 hour timeout (see runs 318 and 323 at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1147#issuecomment-916188676). prj47-rack-09 also run into timeout (5 minutes) with clinfo (run 325).

atamazov commented 2 years ago

The ticket reorganized, all TODO things collected in the topmost comment.

junliume commented 2 years ago

Just to report a few more data points:

Node: MI100-4 rocm-framework-v340i-2.amd.com Error: Failed to run image 'miopen'. Error: docker: Error response from daemon: Unable to find group render. Cause: I guess render is missing from base OS group?

@okakarpa: I marked these two nodes offline from Jenkins, should be an easy fix and get them back online. Thank you! I do not find these nodes' log in info or else I would do it directly :)

atamazov commented 2 years ago

https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1147#issuecomment-935323707 moved to https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/294

atamazov commented 2 years ago

"Statistics of several CI runs launched in a row" updated to cover builds from 352 to 379. There are no new systematically reproducible problems.

atamazov commented 2 years ago
CI# commit Res Failing stage GPU BE Node Failing test Error message
380   :green_circle: - -   - - -
381   :red_circle: Fp32 All gfx90a Hip dell113-r01-u27-28 test_soft_max FAILED: 0.0283859
382   :green_circle: - -   - - -
383.. ..386   :red_circle: - - - - docker build pcre download issue
387   :black_circle: - -   - - (aborted by Artem)
388   :green_circle: - -   - - -
389   :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
390 R 389 :green_circle: - -   - - -
391   :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
392 R 391 :black_circle: Fp32 All Install gfx1030 Hip ixt-sjc2-17 test_? Timeout 5 hours
393.. .. 394 :green_circle: - -   - - -
395.. .. 396 :red_circle: - -   - - git fetch... Permission denied
397 R 396 :green_circle: - -   - - -
398.. ..400   :green_circle: - -   - - -
401   :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-17 test_gpu_reference_kernel MAF
402 R 401 :green_circle: - -   - - -
403   :red_circle: Fp32 All Install Vega20 OCL prj47-rack-91 test_soft_max FAILED: 0.01854
404 R 403 :red_circle: Fp32 All Xnack+ gfx90a Hip hpe-rack-16 test_soft_max FAILED: 0.00613016
405 R 404 :green_circle: - -   - - -
406   :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
407 R 406 :green_circle: - -   - - -
408   :red_circle: Fp32 gfx90a OCL hpe-rack-15 MIOpenDriver build RCF
409 R 408 :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
410   :black_circle: - - - - - (aborted by Jun)
411 R 409 :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_find_db MAF
412 R 411 :red_circle: Fp32 Debug gfx1030 OCL ixt-sjc2-11 test_gpu_reference_kernel MAF
413 R 412  :green_circle: - -   - - -
414 R 410  :cyclone: - -   - - -
atamazov commented 2 years ago

14 successful runs out of 34 (41%)

junliume commented 2 years ago

@okakarpa ixt-sjc2-11 (navi21) Nov 18, 2021 4:17:29 PM Disconnected: Memory access fault on develop run #409, #411, #412

atamazov commented 2 years ago
CI# commit Res Failing stage GPU BE Node Failing test Error message
415.. ..417   :green_circle: - -   - - -
418   :red_circle: Fp32 Vega20 /64 Hip ixt-rack-54 test_gpu_nchw_nhwc_transpose Core dumped
419 R 418 :green_circle: - -   - - -
420   :green_circle: - -   - - -
421   :red_circle: - -   ixt-hq-35 docker build Connection timed out (sourceforge)
422 R 421 :black_circle: - -   - - (aborted by Artem)
423.. ..424   :green_circle: - -   - - -
425   :red_circle: - -   - - git checkout error after force-push
426   :running_man: - -   - - -

All faults seem either random, or expected, or already resolved.