ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 230 forks source link

ctest race condition when using with --verbose option #414

Open jerryyin opened 4 years ago

jerryyin commented 4 years ago

The symptom looks like the failure in this CI log. Both Fiji GCC Debug and Clang Debug failed with 5 test failing in total:

    9 - test_cbna_inference (Failed)
    31 - test_main (Failed) 
    33 - test_na_inference (Failed)
    34 - test_na_train (Failed)
    50 - test_tensor_test (Failed) 

Failure signatures varies quite a lot, ranging from indexing error to corrupted descriptor memory. See the CI log for details.

This is a race condition because none of the above tests fail if run individually. They only fail together when doing make check.

To reproduce, try the following setup:

From add_custom_target(check COMMAND ${CMAKE_CTEST_COMMAND} --output-on-failure -C ${CMAKE_CFG_INTDIR})

To add_custom_target(check COMMAND MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=5 ${CMAKE_CTEST_COMMAND} --verbose -C ${CMAKE_CFG_INTDIR}

daniellowell commented 4 years ago

@jerryyin Is this a real issue?

jerryyin commented 4 years ago

@daniellowell This is the one that get me blocked for a couple of days making me doubt if it is my code change that resulted in the issue above, until I synced with @JehandadKhan and decided to try something else. Then I realized it is a timing related race condition.

I'd suggest you to keep the issue because it is reproducible in very strict conditions. (It has to be the commit I recorded in the reproduce-instructions, I don't promise that you can reproduce it in the latest develop.) I tend to believe --verbose option altered timing of each command run and exposed the issue. It might be much harder to debug/start with should it later discovered in a different context, due to the nature of race condition issues.

atamazov commented 4 years ago

@jerryyin

The symptom looks like the failure in this CI log.

The CI logs are not persistent. Please attach a copy of the log to the ticket.

jerryyin commented 4 years ago

Clang Debug.txt

Fiji GCC Debug.txt

ppanchad-amd commented 8 months ago

@jerryyin Please try latest ROCm 6.0.2 to see if your issue has been resolved? If resolved, please close the ticket. Thanks.