KhronosGroup / Vulkan-ValidationLayers

Vulkan Validation Layers (VVL)
https://vulkan.lunarg.com/doc/sdk/latest/linux/khronos_validation_layer.html
Other
747 stars 400 forks source link

possible Android memory corruption in validation or SPIRV used by validation #8439

Open lunarpapillo opened 3 weeks ago

lunarpapillo commented 3 weeks ago

Environment:

Describe the Issue

When building and testing a Debug build using Android NDK 26.3, tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest, inside an allocator within SPIRV-Tools:

#00 libVkLayer_khronos_validation.so (void std::__ndk1::allocator<unsigned int>::construct[abi:v170000]<unsigned int, unsigned int const&>(unsigned int*, unsigned int const&)+28)
...
#04 libVkLayer_khronos_validation.so (std::__ndk1::__wrap_iter<unsigned int*> std::__ndk1::vector<unsigned int, std::__ndk1::allocator<unsigned int> >::insert<std::__ndk1::__wrap_iter<unsigned int const*>, 0>(std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>, std::__ndk1::__wrap_iter<unsigned int const*>)+344) 
#05 libVkLayer_khronos_validation.so (spvtools::val::ValidationState_t::RegisterUniqueTypeDeclaration(spvtools::val::Instruction const*)+416)
#06 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateUniqueness(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+172)
#07 libVkLayer_khronos_validation.so (spvtools::val::TypePass(spvtools::val::ValidationState_t&, spvtools::val::Instruction const*)+88) 
#08 libVkLayer_khronos_validation.so (spvtools::val::(anonymous namespace)::ValidateBinaryUsingContextAndValidationState(spv_context_t const&, unsigned int const*, unsigned long, spv_diagnostic_t**, spvtools::val::ValidationState_t*)+3824) 
#09 libVkLayer_khronos_validation.so (spvValidateWithOptions+164)
#10 libVkLayer_khronos_validation.so (CoreChecks::RunSpirvValidation(spv_const_binary_t&, Location const&, ValidationCache*) const+296)
#11 libVkLayer_khronos_validation.so (CoreChecks::ValidateShaderModuleCreateInfo(VkShaderModuleCreateInfo const&, Location const&) const+692)
#12 libVkLayer_khronos_validation.so (CoreChecks::PreCallValidateCreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**, ErrorObject const&) const+104) 
#13 libVkLayer_khronos_validation.so (vulkan_layer_chassis::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+248)
#14  /system/lib64/libvulkan.so (vulkan::api::(anonymous namespace)::CreateShaderModule(VkDevice_T*, VkShaderModuleCreateInfo const*, VkAllocationCallbacks const*, VkShaderModule_T**)+160)
#15 libVulkanLayerValidationTests.so (vkt::ShaderModule::init(vkt::Device const&, VkShaderModuleCreateInfo const&)+168)
#16 libVulkanLayerValidationTests.so (VkShaderObj::InitFromGLSL(void const*)+224)
#17 libVulkanLayerValidationTests.so (VkShaderObj::VkShaderObj(VkRenderFramework*, char const*, VkShaderStageFlagBits, spv_target_env, SpvSourceType, VkSpecializationInfo const*, char const*, void const*)+268) 
#18 libVulkanLayerValidationTests.so (VkArmBestPracticesLayerTest_ComputeShaderBadSpatialLocalityTest_Test::TestBody()+296)
...

The full ndk-stack output is available: 008-ndk-stack-info.txt

The crash appears when using a Debug build with Android NDK 26.3. It does not appear when using a Release build with NDK 26.3, nor (using either a Release or a Debug build) with either NDK 25.2 or NDK 27.0.

Given that the code appears to run correctly in a Release build, that the crash is device-independent, and that the crash occurs during memory allocation, it's fairly likely that the compiler isn't the issue, and that that something in validation or SPIRV is causing memory corruption that happens to cause a validation crash when memory is laid out "just right". If Address Sanitizer is supported on Android, it might be helpful in uncovering such a corruption.

It's possible, though IMHO unlikely, that this is an unknown compiler bug that appeared in NDK 26 and disappeared in NDK 27, as symptoms like this are not listed as known issues: https://github.com/android/ndk/releases

To reproduce the problem, run a manual-Vulkan-ValidationLayers build with: http://tcubuser.lunarg.localdomain:8080/view/Manual/job/manual-Vulkan-ValidationLayers/build

lunarpapillo commented 3 weeks ago

For reference, original chat is: https://chat.google.com/room/AAAAOXVAYGg/FL0Vh98x-gM/FL0Vh98x-gM?cls=10

spencer-lunarg commented 3 weeks ago

tests crash on all devices in the same place in VkArmBestPracticesLayerTest.ComputeShaderBadSpatialLocalityTest,

This is 99% because VkArm is alphabetically first and it will crash in any test

mikes-lunarg commented 2 weeks ago

I was working on a minimal repro case and got it down to this, note that I'm not even creating a Vulkan instance:

TEST_F(PositiveTooling, Issue8439) {
    std::vector<uint32_t> spv = {
        0x07230203, 0x00010000, 0x0008000b, 0x00000019, 0x00000000, 0x00020011, 0x00000001, 0x0006000b, 
        0x00000001, 0x4c534c47, 0x6474732e, 0x3035342e, 0x00000000, 0x0003000e, 0x00000000, 0x00000001, 
        0x0005000f, 0x00000005, 0x00000004, 0x6e69616d, 0x00000000, 0x00060010, 0x00000004, 0x00000011, 
        0x00000008, 0x00000008, 0x00000001, 0x00030003, 0x00000002, 0x000001c2, 0x00040005, 0x00000004, 
        0x6e69616d, 0x00000000, 0x00040005, 0x00000009, 0x756c6176, 0x00000065, 0x00050005, 0x0000000d, 
        0x6d615375, 0x72656c70, 0x00000000, 0x00040047, 0x0000000d, 0x00000022, 0x00000000, 0x00040047, 
        0x0000000d, 0x00000021, 0x00000000, 0x00040047, 0x00000018, 0x0000000b, 0x00000019, 0x00020013, 
        0x00000002, 0x00030021, 0x00000003, 0x00000002, 0x00030016, 0x00000006, 0x00000020, 0x00040017, 
        0x00000007, 0x00000006, 0x00000004, 0x00040020, 0x00000008, 0x00000007, 0x00000007, 0x00090019, 
        0x0000000a, 0x00000006, 0x00000001, 0x00000000, 0x00000000, 0x00000000, 0x00000001, 0x00000000, 
        0x0003001b, 0x0000000b, 0x0000000a, 0x00040020, 0x0000000c, 0x00000000, 0x0000000b, 0x0004003b, 
        0x0000000c, 0x0000000d, 0x00000000, 0x00040017, 0x0000000f, 0x00000006, 0x00000002, 0x0004002b, 
        0x00000006, 0x00000010, 0x3f000000, 0x0005002c, 0x0000000f, 0x00000011, 0x00000010, 0x00000010, 
        0x0004002b, 0x00000006, 0x00000012, 0x00000000, 0x00040015, 0x00000014, 0x00000020, 0x00000000, 
        0x00040017, 0x00000015, 0x00000014, 0x00000003, 0x0004002b, 0x00000014, 0x00000016, 0x00000008, 
        0x0004002b, 0x00000014, 0x00000017, 0x00000001, 0x0006002c, 0x00000015, 0x00000018, 0x00000016, 
        0x00000016, 0x00000017, 0x00050036, 0x00000002, 0x00000004, 0x00000000, 0x00000003, 0x000200f8, 
        0x00000005, 0x0004003b, 0x00000008, 0x00000009, 0x00000007, 0x0004003d, 0x0000000b, 0x0000000e, 
        0x0000000d, 0x00070058, 0x00000007, 0x00000013, 0x0000000e, 0x00000011, 0x00000002, 0x00000012, 
        0x0003003e, 0x00000009, 0x00000013, 0x000100fd, 0x00010038, 
    };

    spv_target_env spirv_environment = SPV_ENV_VULKAN_1_0;
    spv_context ctx = spvContextCreate(spirv_environment);
    spvtools::ValidatorOptions spirv_val_options;
    spv_const_binary_t binary{spv.data(), spv.size()};
    spv_diagnostic diag = nullptr;

    const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
    ASSERT_TRUE(spv_valid == SPV_SUCCESS);

    spvDiagnosticDestroy(diag);
    spvContextDestroy(ctx);
}

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

lunarpapillo commented 2 weeks ago

Weird thing is that if I add the same test to the SPIRV-Tools unit tests, it works fine! Same SPIRV-Tools commit, same CMake flags, same NDK.

Do the SPIRV-Tools unit tests also run on Android?

mikes-lunarg commented 2 weeks ago

By default, SPIRV-Tools tests do not run on Android. I was able to run them by commenting out these lines: https://github.com/KhronosGroup/SPIRV-Tools/blob/main/CMakeLists.txt#L315-L317 and then manually pushing and running the test executable using the adb shell.

lunarpapillo commented 2 weeks ago

Weird...

const spv_result_t spv_valid = spvValidateWithOptions(ctx, spirv_val_options, &binary, &diag);
ASSERT_TRUE(spv_valid == SPV_SUCCESS);

spvDiagnosticDestroy(diag);
spvContextDestroy(ctx);

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

mikes-lunarg commented 2 weeks ago

I presume the crash occurs in spvValidateWithOptions(), as it seems to with the VVL tests, and the stack trace is otherwise similar; I presume you were also running the test in isolation via --gtest_filter, yes?

Yes and yes. And just like your initial writup, this only affects the Debug build. Release builds make it past the the spvValidateWithOptions() call and pass the assert.

Since it works in SPIRV-Tools unit tests, do you have an hypothesis as to why it fails deterministically in VVL? I've got nothing...

No real hypothesis yet. The fact that the test code works in one build (SPIRV-Tools) and not the other (VVL) makes me suspect something about how we build/package libSPIRV-Tools

mikes-lunarg commented 2 weeks ago

Similar issue: https://github.com/KhronosGroup/glslang/issues/3534

That reporter traced it back to a specific constructor for std::vector and patched around it by constructing the vector using a different method