Open junliume opened 3 months ago
@BrianHarrisonAMD @atamazov I suspect -fsanitize=undefined
but need more investigation.
It must be one of these:
-DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored '
Update: confirmed it is due to -fsanitize=undefined
@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.
@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.
Unfortunately, in my short experiment reverting #2524 does not resolve this issue.
We do see lots of warning messages like:
Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1
@junliume
We do see lots of warning messages like:
Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1
IIRC sometimes we need to know the amount of free GPU memory and use hipMemGetInfo to query this info. But in some cases, this HIP function does fail, and I have no idea why. The workaround (which issues a warning and simply returns some fixed value) was introduced in #2333, https://github.com/ROCm/MIOpen/pull/2333/commits/6477e68cc0c98655700ec6dd83b1da7f608231c4
I suspect that the reason of HIP runtime failure is a combination of severely outdated base driver + new rocm in docker + some target asics. I think that we need some assistance from HIP runtime team.
@junliume ...but I do not think this is related to this specific issue with UB.
Not sure if this was already known, but I tracked it down to the test_find_db testsuite, and it appears to be from calling the following in solver_finders.cpp:
std::transform(
finders.begin(), finders.end(), std::inserter(solutions, solutions.end()), [&](auto&& f) {
return std::make_pair(f->GetAlgorithmName(problem),
f->Find(ctx, problem, invoke_ctx, parameters, options));
});
Seems to be calling Find on the finders causes this issue in the test.
Update, for the find_db.cpp test, I changed it to only run the forward test, and narrowed it down to miopen::solver::conv::ConvMlirIgemmFwdXdlops causing the above issue for me.
This change to mlo_dir_conv.cpp fixes the forwards test for me:
static auto GetImplicitGemmSolvers()
{
return miopen::solver::SolverContainer<
miopen::solver::conv::ConvHipImplicitGemmForwardV4R5Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops,
miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops_Padded_Gemm,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1Xdlops,
miopen::solver::conv::ConvHipImplicitGemmV4R1Fwd,
miopen::solver::conv::ConvHipImplicitGemmV4R4Fwd,
// miopen::solver::conv::ConvMlirIgemmFwdXdlops,
miopen::solver::conv::ConvMlirIgemmFwd,
miopen::solver::conv::ConvMlirIgemmBwdXdlops,
miopen::solver::conv::ConvMlirIgemmBwd,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1,
miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd_1x1,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd,
miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicBwd,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlops,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC,
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC,
miopen::solver::conv::ConvCkIgemmFwdV6r1DlopsNchw,
#if MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvHipImplicitGemmFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupFwdXdlops,
miopen::solver::conv::ConvHipImplicitGemm3DGroupBwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16FwdXdlops,
miopen::solver::conv::ConvHipImplicitGemmF16F8F16BwdXdlops,
#endif // MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC>{};
}
Going to dig a bit deeper to see what's the issue with that one solver.
Edit: Looks like the issue happens for me if I call any of the miir API's, and goes away if I prevent those from happening.
This line is enough for it to trigger the issue for me:
miirCreateHandle(params.c_str());
Looks like it's due to the params the handle is created with, but not sure yet what caused this to be an issue now.
Adding a branch to suppress the ubsan errors since it's coming from MLIR handle creation, and our options are limited since we are using an older version.
PR up with suppression changes #3198
@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h
is the culprit. Could we be ignoring some problem in our code by suppressing these errors?
@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that
shared_ptr_base.h
is the culprit. Could we be ignoring some problem in our code by suppressing these errors?
@amberhassaan shared_ptr_base.h isn't the issue, but it's where the ubsan error comes from during teardown of the application, and it's the only way I could find to suppress the error. The issue can be narrowed down to just creating a MLIR handle with nothing else happening, (I made a reproducer for that), and appears to be due to something in MLIR cleaning up static memory during exit.
Another byproduct of #3181
LastTest.log
The error message:
[How to reproduce]:
cmake command:
and then