Open jerryyin opened 4 years ago
@jerryyin Is this a real issue?
@daniellowell This is the one that get me blocked for a couple of days making me doubt if it is my code change that resulted in the issue above, until I synced with @JehandadKhan and decided to try something else. Then I realized it is a timing related race condition.
I'd suggest you to keep the issue because it is reproducible in very strict conditions. (It has to be the commit I recorded in the reproduce-instructions, I don't promise that you can reproduce it in the latest develop.) I tend to believe --verbose
option altered timing of each command run and exposed the issue. It might be much harder to debug/start with should it later discovered in a different context, due to the nature of race condition issues.
@jerryyin
The symptom looks like the failure in this CI log.
The CI logs are not persistent. Please attach a copy of the log to the ticket.
@jerryyin Please try latest ROCm 6.0.2 to see if your issue has been resolved? If resolved, please close the ticket. Thanks.
The symptom looks like the failure in this CI log. Both
Fiji GCC Debug
andClang Debug
failed with 5 test failing in total:Failure signatures varies quite a lot, ranging from indexing error to corrupted descriptor memory. See the CI log for details.
This is a race condition because none of the above tests fail if run individually. They only fail together when doing
make check
.To reproduce, try the following setup:
test/CmakeLists.txt
(also see here):From
add_custom_target(check COMMAND ${CMAKE_CTEST_COMMAND} --output-on-failure -C ${CMAKE_CFG_INTDIR})
To
add_custom_target(check COMMAND MIOPEN_ENABLE_LOGGING=1 MIOPEN_LOG_LEVEL=5 ${CMAKE_CTEST_COMMAND} --verbose -C ${CMAKE_CFG_INTDIR}
rocm/tensorflow-private:zyin-miopen-debug
. Note this is the container directly stored off from CI machine. Technically it is the same if created following CI log.CXX=g++-5 CXXFLAGS='-Werror' cmake -DMIOPEN_GPU_SYNC=On -DMIOPEN_TEST_FLAGS='--disable-verification-cache' -DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined' -DBUILD_DEV=On -DCMAKE_BUILD_TYPE=debug ..
MIOPEN_DEBUG_CONV_IMPLICIT_GEMM_XDLOPS=1 CTEST_PARALLEL_LEVEL=4 MIOPEN_VERIFY_CACHE_PATH=/var/jenkins/.cache/miopen/vcache MIOPEN_CONV_PRECISE_ROCBLAS_TIMING=0 dumb-init make -j$(nproc) check doc MIOpenDriver