ROCm / MIOpen

AMD's Machine Intelligence Library
https://rocm.docs.amd.com/projects/MIOpen/en/latest/
Other
1.08k stars 228 forks source link

[CI][Failure] shared_ptr_base.h:199:9: runtime error: member call on address which does not point to an object of type 'std::_Sp_counted_base<>' #3192

Open junliume opened 3 months ago

junliume commented 3 months ago

Another byproduct of #3181

LastTest.log

The error message:

/usr/lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/bits/shared_ptr_base.h:199:9: runtime error: member call on address 0x00000b9e6590 which does not point to an object of type 'std::_Sp_counted_base<>'
0x00000b9e6590: note: object has invalid vptr
 00 00 00 00  d8 c0 dd 8e 53 7f 00 00  00 00 00 00 02 00 00 00  d9 01 00 00 00 00 00 00  30 d5 ac 10
              ^~~~~~~~~~~~~~~~~~~~~~~
              invalid vptr
    #0 0x7f53852a1bc7  (/data/MIOpen/build/lib/libMIOpen.so.1+0x29a5ebc7)
    #1 0x7f538e2c87cb  (/data/MIOpen/build/lib/libMIOpen.so.1+0x32a857cb)
    #2 0x7f530f9c2d9e  (/lib/x86_64-linux-gnu/libc.so.6+0x45d9e) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #3 0x7f530f9c25c8  (/lib/x86_64-linux-gnu/libc.so.6+0x455c8) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #4 0x7f530f9c260f  (/lib/x86_64-linux-gnu/libc.so.6+0x4560f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #5 0x7f530f9a6d96  (/lib/x86_64-linux-gnu/libc.so.6+0x29d96) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #6 0x7f530f9a6e3f  (/lib/x86_64-linux-gnu/libc.so.6+0x29e3f) (BuildId: 490fef8403240c91833978d494d39e537409b92e)
    #7 0x249004  (/data/MIOpen/build/bin/test_find_db+0x249004)

[How to reproduce]:

cmake command:

CXX=/opt/rocm/llvm/bin/clang++ CXXFLAGS='-Werror'  cmake -DMIOPEN_TEST_FLAGS=' --disable-verification-cache ' -DCMAKE_BUILD_TYPE=debug -DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored ' -DBUILD_DEV=Off -DMIOPEN_USE_MLIR=ON -DMIOPEN_GPU_SYNC=Off  -DCMAKE_PREFIX_PATH=/opt/rocm    ..

and then

LLVM_PATH=/opt/rocm/llvm CTEST_PARALLEL_LEVEL=4  make -j$(nproc) install  check MIOpenDriver
junliume commented 3 months ago

@BrianHarrisonAMD @atamazov I suspect -fsanitize=undefined but need more investigation.

It must be one of these:

-DCMAKE_CXX_FLAGS_DEBUG='-g -fno-omit-frame-pointer -fsanitize=undefined -fno-sanitize-recover=undefined -Wno-option-ignored '

Update: confirmed it is due to -fsanitize=undefined

atamazov commented 3 months ago

@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.

junliume commented 3 months ago

@junliume @amberhassaan @DrizztDoUrden AFAICS, UB is related to hipFree. I recommend checking if reverting #2524 resolves the issue.

Unfortunately, in my short experiment reverting #2524 does not resolve this issue.

We do see lots of warning messages like:

Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1

atamazov commented 3 months ago

@junliume

We do see lots of warning messages like:

Warning [hip_mem_get_info_wrapper] hipMemGetInfo error, status: 1

IIRC sometimes we need to know the amount of free GPU memory and use hipMemGetInfo to query this info. But in some cases, this HIP function does fail, and I have no idea why. The workaround (which issues a warning and simply returns some fixed value) was introduced in #2333, https://github.com/ROCm/MIOpen/pull/2333/commits/6477e68cc0c98655700ec6dd83b1da7f608231c4

I suspect that the reason of HIP runtime failure is a combination of severely outdated base driver + new rocm in docker + some target asics. I think that we need some assistance from HIP runtime team.

atamazov commented 3 months ago

@junliume ...but I do not think this is related to this specific issue with UB.

BrianHarrisonAMD commented 3 months ago

Not sure if this was already known, but I tracked it down to the test_find_db testsuite, and it appears to be from calling the following in solver_finders.cpp:

    std::transform(
        finders.begin(), finders.end(), std::inserter(solutions, solutions.end()), [&](auto&& f) {
            return std::make_pair(f->GetAlgorithmName(problem),
                                  f->Find(ctx, problem, invoke_ctx, parameters, options));
        });

Seems to be calling Find on the finders causes this issue in the test.

BrianHarrisonAMD commented 3 months ago

Update, for the find_db.cpp test, I changed it to only run the forward test, and narrowed it down to miopen::solver::conv::ConvMlirIgemmFwdXdlops causing the above issue for me.

This change to mlo_dir_conv.cpp fixes the forwards test for me:

static auto GetImplicitGemmSolvers()
{
    return miopen::solver::SolverContainer<
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R5Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmForwardV4R4Xdlops_Padded_Gemm,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1Xdlops,
        miopen::solver::conv::ConvHipImplicitGemmV4R1Fwd,
        miopen::solver::conv::ConvHipImplicitGemmV4R4Fwd,
        // miopen::solver::conv::ConvMlirIgemmFwdXdlops,
        miopen::solver::conv::ConvMlirIgemmFwd,
        miopen::solver::conv::ConvMlirIgemmBwdXdlops,
        miopen::solver::conv::ConvMlirIgemmBwd,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV1R1,
        miopen::solver::conv::ConvHipImplicitGemmBwdDataV4R1,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd_1x1,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicFwd,
        miopen::solver::conv::ConvAsmImplicitGemmV4R1DynamicBwd,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlops,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlops,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdXdlopsNHWC,
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicBwdXdlopsNHWC,
        miopen::solver::conv::ConvCkIgemmFwdV6r1DlopsNchw,
#if MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
        miopen::solver::conv::ConvHipImplicitGemmFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmGroupFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmGroupBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemm3DGroupFwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemm3DGroupBwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmF16F8F16FwdXdlops,
        miopen::solver::conv::ConvHipImplicitGemmF16F8F16BwdXdlops,
#endif // MIOPEN_BACKEND_HIP && MIOPEN_USE_COMPOSABLEKERNEL
        miopen::solver::conv::ConvAsmImplicitGemmGTCDynamicFwdDlopsNCHWC>{};
}

Going to dig a bit deeper to see what's the issue with that one solver.

Edit: Looks like the issue happens for me if I call any of the miir API's, and goes away if I prevent those from happening.

This line is enough for it to trigger the issue for me:

miirCreateHandle(params.c_str());

Looks like it's due to the params the handle is created with, but not sure yet what caused this to be an issue now.

BrianHarrisonAMD commented 3 months ago

Adding a branch to suppress the ubsan errors since it's coming from MLIR handle creation, and our options are limited since we are using an older version.

PR up with suppression changes #3198

amberhassaan commented 3 months ago

@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h is the culprit. Could we be ignoring some problem in our code by suppressing these errors?

BrianHarrisonAMD commented 3 months ago

@BrianHarrisonAMD , @junliume : Do we know what causes the error? It can't be that shared_ptr_base.h is the culprit. Could we be ignoring some problem in our code by suppressing these errors?

@amberhassaan shared_ptr_base.h isn't the issue, but it's where the ubsan error comes from during teardown of the application, and it's the only way I could find to suppress the error. The issue can be narrowed down to just creating a MLIR handle with nothing else happening, (I made a reproducer for that), and appears to be due to something in MLIR cleaning up static memory during exit.