Open atamazov opened 3 years ago
develop
branchCI# | commit | Res | Failing stage | GPU | BE | Node | Failing test | Error message |
---|---|---|---|---|---|---|---|---|
311 | 068bb6b1 | :green_circle: | - | - | - | - | ||
312 | 502df2b | :red_circle: | Fp32 Debug /opt/rocm | MI100 | Hip | trex-vg-20 | (build) | Fetch |
313 | R 312 | :red_circle: | Fp32 Install All | Vega20 | OCL | prj47-rack-91 | test_soft_max | FAILED: 0.0442317 |
314 | R 313 | :red_circle: | Bf16 nstall All | gfx90a | Hip | dell113-r01-u31-32 | (build) | GPU detection |
315 | R 314 | :black_circle: | - | - | - | - | Aborted by user | |
316 | R 313 | :green_circle: | - | - | - | - | - | |
317 | 2509c87 | :red_circle: | Fp16 /opt/rocm | gfx90a | Hip | dell113-r01-u05-06 | (build) | GPU detection |
318 | 1a52c62 | :black_circle: | Fp32 /opt/rocm | Vega10 | Hip | prj47-rack-19 | test_? | Timeout 5 hours |
319 | 3b1b3ac | :red_circle: | Fp32 Debug /opt/rocm | gfx90a | Hip | dell113-r01-u31-32 | (build) | GPU detection |
320 | 2d7b2dd | :red_circle: | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_group | FRDI * |
321.. ..322 | :green_circle: | - | - | - | - | - | ||
323 | 030a369 | :black_circle: | Fp32 MLIR | Vega10 | Hip | prj47-rack-19 | test_? | Timeout 5 hours |
324 | R 323 | :green_circle: | - | - | - | - | - | |
325 | 1a1df9f | :black_circle: | Fp32 Debug | Vega10 | Hip | prj47-rack-09 | clinfo | Timeout 5 minutes |
326 | R 325 | :red_circle: | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_for_implicit_gemm | FRDI * |
327 | R 326 | :red_circle: | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv_group | FRDI * |
328.. ..329 | :green_circle: | - | - | - | - | - | ||
330 | bc464d5 | :red_circle: | Fp32 All | Vega20 /64 | Hip | ixt-rack-54 | test_conv3d | FRDI |
331.. ..334 | :green_circle: | - | - | - | - | - | ||
335 | f21cdc1 | :red_circle: | Bf16 Install | gfx908 | Hip | pytorch-vg20-1 | (REBOOT SLAVES!) | connect timed out |
336 | 886fc21 | :red_circle: | Fp16 MLIR | gfx908 | Hip | MI100-5 | (REBOOT SLAVES!) | api.github.com |
337.. ..338 | :green_circle: | - | - | - | - | - | ||
339 | 568c2e4 | :red_circle: | - | - | hpe-rack-16 | docker build | (boost download) | |
340 | R 339 | :red_circle: | - | - | hpe-rack-16 | docker build | (boost download) | |
341 | R 340 | :black_circle: | - | - | - | - | (aborted by user) | |
342 | R 341 | :red_circle: | Fp32 Debug + Codecov | Vega20 | OCL | prj47-rack-91 | test_sqlite_perfdb | runtime error: index 624 out of bounds... |
343 | R 342 | :red_circle: | Bf16 /opt/rocm | gfx90a | Hip | dell113-r01-u31-32 | build | GPU detection |
344.. ..347 | :green_circle: | - | - | - | - | - | ||
348 | 56215d6 | :red_circle: | Int8 All | Vega20 /64 | Hip | ixt-rack-54 | test_tensor_vec | Iteration: 24 \ Mismatch at 4736223: ! != ) |
349 | R 348 | :red_circle: | Fp16 All Install | gfx908 | Hip | MI100-5 | test_regression_half_mi100 | (error as expected) |
350 | 12e52ed | :red_circle: | Fp32 All | gfx908 | Hip | v340l-3 | test_conv_group | (none - INTERNAL ERROR?) |
351.. ..360 | - | :green_circle: | - | - | - | - | - | |
361 | e0ded03 | :red_circle: | Fp32 | gfx90a | OpenCL | hpe-rack-16 | build | GPU detection |
362 | 8498875 | :red_circle: | Fp32 | gfx90a | OpenCL | hpe-rack-15 | build | GPU detection |
363 | 8498875 | :black_circle: | - | - | - | - | - | (aborted by Jun) |
364 | e0ded03 | :black_circle: | - | - | - | - | - | (aborted by Jun) |
365 | R 363 | :green_circle: | - | - | - | - | - | |
366 | R 364 | :green_circle: | - | - | - | - | - | |
367 | f091329 | :red_circle: | Fp32 | gfx90a | Hip | dell113-r01-u31-32 | build | GPU detection |
368 | R 367 | :green_circle: | - | - | - | - | - | |
369.. ..373 | - | :black_circle: | - | - | - | - | (aborted by Artem) | |
374 | R 373 | :green_circle: | - | - | - | - | - | |
375 | d4f48bd | :red_circle: | Fp32 | gfx908 | OCL | pytorch-vg20-1 | mlir testing | (tests broken) |
376 | 0a095af | :red_circle: | Fp32 | gfx908 | OCL | ixt-sjc2-11 | (REBOOT SLAVES!) | n/a |
377 | R 376 | :green_circle: | - | - | - | - | - | |
378 | 5cb2e54 | :red_circle: | Fp32 All | gfx1030 | OCL | ixt-sjc2-16 | docker build | permission denied... Docker daemon socket |
379 | R 378 | :green_circle: | - | - | - | - | - |
@atamazov I have brought down the following two gfx908 nodes temporarily, because the RCF issues are happening on these two nodes too often: rocm-frameworks-v340i-1.amd.com rocm-frameworks-v340i-2.amd.com
@junliume https://github.com/ROCmSoftwarePlatform/frameworks-internal/issues/285 created
@junliume Please do NOT change the ci-instability-investigation
branch. The idea is to run the same good commit several times in a row. ~I will revert the changes.~
@junliume Aha, I see. You would like to exclude Full gfx1030 tests. Okay. Let's keep this.
Testing on Vega10 (prj47-rack-19) often ends with 5 hour timeout (see runs 318 and 323 at https://github.com/ROCmSoftwarePlatform/MIOpen/issues/1147#issuecomment-916188676). prj47-rack-09 also run into timeout (5 minutes) with clinfo
(run 325).
The ticket reorganized, all TODO things collected in the topmost comment.
Just to report a few more data points:
Node: MI100-4 rocm-framework-v340i-2.amd.com Error: Failed to run image 'miopen'. Error: docker: Error response from daemon: Unable to find group render. Cause: I guess render is missing from base OS group?
@okakarpa: I marked these two nodes offline from Jenkins, should be an easy fix and get them back online. Thank you! I do not find these nodes' log in info or else I would do it directly :)
"Statistics of several CI runs launched in a row" updated to cover builds from 352 to 379. There are no new systematically reproducible problems.
CI# | commit | Res | Failing stage | GPU | BE | Node | Failing test | Error message |
---|---|---|---|---|---|---|---|---|
380 | :green_circle: | - | - | - | - | - | ||
381 | :red_circle: | Fp32 All | gfx90a | Hip | dell113-r01-u27-28 | test_soft_max | FAILED: 0.0283859 | |
382 | :green_circle: | - | - | - | - | - | ||
383.. ..386 | :red_circle: | - | - | - | - | docker build | pcre download issue | |
387 | :black_circle: | - | - | - | - | (aborted by Artem) | ||
388 | :green_circle: | - | - | - | - | - | ||
389 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_gpu_reference_kernel | MAF | |
390 | R 389 | :green_circle: | - | - | - | - | - | |
391 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_gpu_reference_kernel | MAF | |
392 | R 391 | :black_circle: | Fp32 All Install | gfx1030 | Hip | ixt-sjc2-17 | test_? | Timeout 5 hours |
393.. .. 394 | :green_circle: | - | - | - | - | - | ||
395.. .. 396 | :red_circle: | - | - | - | - | git fetch... Permission denied | ||
397 | R 396 | :green_circle: | - | - | - | - | - | |
398.. ..400 | :green_circle: | - | - | - | - | - | ||
401 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-17 | test_gpu_reference_kernel | MAF | |
402 | R 401 | :green_circle: | - | - | - | - | - | |
403 | :red_circle: | Fp32 All Install | Vega20 | OCL | prj47-rack-91 | test_soft_max | FAILED: 0.01854 | |
404 | R 403 | :red_circle: | Fp32 All Xnack+ | gfx90a | Hip | hpe-rack-16 | test_soft_max | FAILED: 0.00613016 |
405 | R 404 | :green_circle: | - | - | - | - | - | |
406 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_gpu_reference_kernel | MAF | |
407 | R 406 | :green_circle: | - | - | - | - | - | |
408 | :red_circle: | Fp32 | gfx90a | OCL | hpe-rack-15 | MIOpenDriver build | RCF | |
409 | R 408 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_gpu_reference_kernel | MAF |
410 | :black_circle: | - | - | - | - | - | (aborted by Jun) | |
411 | R 409 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_find_db | MAF |
412 | R 411 | :red_circle: | Fp32 Debug | gfx1030 | OCL | ixt-sjc2-11 | test_gpu_reference_kernel | MAF |
413 | R 412 | :green_circle: | - | - | - | - | - | |
414 | R 410 | :cyclone: | - | - | - | - | - |
@okakarpa ixt-sjc2-11 (navi21) Nov 18, 2021 4:17:29 PM Disconnected: Memory access fault on develop run #409, #411, #412
CI# | commit | Res | Failing stage | GPU | BE | Node | Failing test | Error message |
---|---|---|---|---|---|---|---|---|
415.. ..417 | :green_circle: | - | - | - | - | - | ||
418 | :red_circle: | Fp32 | Vega20 /64 | Hip | ixt-rack-54 | test_gpu_nchw_nhwc_transpose | Core dumped | |
419 | R 418 | :green_circle: | - | - | - | - | - | |
420 | :green_circle: | - | - | - | - | - | ||
421 | :red_circle: | - | - | ixt-hq-35 | docker build | Connection timed out (sourceforge) | ||
422 | R 421 | :black_circle: | - | - | - | - | (aborted by Artem) | |
423.. ..424 | :green_circle: | - | - | - | - | - | ||
425 | :red_circle: | - | - | - | - | git checkout error after force-push | ||
426 | :running_man: | - | - | - | - | - |
All faults seem either random, or expected, or already resolved.
This is umbrella ticket intended to collect & analyze information related to CI instability that we currently observe.
A list of tickets that may affect the CI stability
Findings from CI runs launched in a row