KhronosGroup / glslang

Khronos-reference front end for GLSL/ESSL, partial front end for HLSL, and a SPIR-V generator.
Other
3.04k stars 835 forks source link

HLSL: intermittent assert/crash in spvtools::opt::analysis::DefUseManager::AnalyzeInstUse #1205

Closed danginsburg closed 6 years ago

danginsburg commented 6 years ago

I updated to the latest glslang master (to commit 798d005c) and in trying to rebuild all of our HLSL shaders ran into an intermittent crash in Release (and intermittent assert in Debug). It is most likely a spirv-opt bug, but it is hard for me to say for sure because it is hard to reproduce.

Repro:

   glslangValidator -V --sep MainVs -e main -Os --iy -D crash_hlsl.vert

Intermittently, I will see the following assert:

    glslangValidator.exe!issue_debug_notification(const wchar_t * const message) Line 125   C++
    glslangValidator.exe!__acrt_report_runtime_error(const wchar_t * message) Line 142  C++
    glslangValidator.exe!abort() Line 61    C++
    glslangValidator.exe!common_assert_to_stderr_direct(const wchar_t * const expression, const wchar_t * const file_name, const unsigned int line_number) Line 124 C++
    glslangValidator.exe!common_assert_to_stderr<wchar_t>(const wchar_t * const expression, const wchar_t * const file_name, const unsigned int line_number) Line 142   C++
    glslangValidator.exe!common_assert<wchar_t>(const wchar_t * const expression, const wchar_t * const file_name, const unsigned int line_number, void * const return_address) Line 383    C++
    glslangValidator.exe!_wassert(const wchar_t * expression, const wchar_t * file_name, unsigned int line_number) Line 405 C++
>   glslangValidator.exe!spvtools::opt::analysis::DefUseManager::AnalyzeInstUse(spvtools::ir::Instruction * inst) Line 54   C++
    [External Code] 
    glslangValidator.exe!spvtools::ir::Instruction::ForEachInst(const std::function<void __cdecl(spvtools::ir::Instruction *)> & f, bool run_on_debug_line_insts) Line 405  C++
    glslangValidator.exe!spvtools::ir::BasicBlock::ForEachInst(const std::function<void __cdecl(spvtools::ir::Instruction *)> & f, bool run_on_debug_line_insts) Line 182   C++
    glslangValidator.exe!spvtools::ir::Function::ForEachInst(const std::function<void __cdecl(spvtools::ir::Instruction *)> & f, bool run_on_debug_line_insts) Line 48  C++
    glslangValidator.exe!spvtools::ir::Module::ForEachInst(const std::function<void __cdecl(spvtools::ir::Instruction *)> & f, bool run_on_debug_line_insts) Line 86    C++
    glslangValidator.exe!spvtools::opt::analysis::DefUseManager::AnalyzeDefUse(spvtools::ir::Module * module) Line 177  C++
    glslangValidator.exe!spvtools::opt::analysis::DefUseManager::DefUseManager(spvtools::ir::Module * module) Line 105  C++
    glslangValidator.exe!spvtools::ir::IRContext::IsConsistent() Line 176   C++
    glslangValidator.exe!spvtools::opt::Pass::Run(spvtools::ir::IRContext * ctx) Line 110   C++
    glslangValidator.exe!spvtools::opt::PassManager::Run(spvtools::ir::IRContext * context) Line 24 C++
    glslangValidator.exe!spvtools::Optimizer::Run(const unsigned int * original_binary, unsigned __int64 original_binary_size, std::vector<unsigned int,std::allocator<unsigned int> > * optimized_binary) Line 131 C++
    glslangValidator.exe!glslang::GlslangToSpv(const glslang::TIntermediate & intermediate, std::vector<unsigned int,std::allocator<unsigned int> > & spirv, spv::SpvBuildLogger * logger, glslang::SpvOptions * options) Line 6106 C++
    glslangValidator.exe!CompileAndLinkShaderUnits(std::vector<ShaderCompUnit,std::allocator<ShaderCompUnit> > compUnits) Line 946  C++
    glslangValidator.exe!CompileAndLinkShaderFiles(glslang::TWorklist & Worklist) Line 1023 C++
    glslangValidator.exe!singleMain() Line 1089 C++
    glslangValidator.exe!main(int argc, char * * argv) Line 1142    C++
    [External Code] 

In our distributed shader compile system that compiles 100k+ shaders, I will intermittently get a crash. I've been unable to isolate this to a 100% reproducer. However, if I run in a debug build with the command-line above enough times I will see the assert from time-to-time. It seems like some kind of memory corruption or non-determinism.

crash_hlsl.zip

danginsburg commented 6 years ago

The specific pass being run during the assert is spvtools::opt::DeadBranchElimPass.

danginsburg commented 6 years ago

I found a way to reproduce the crash reliably. Change iterations to 100 here in StandAlone.cpp:

https://github.com/KhronosGroup/glslang/blob/798d005ccdfb28ff8ec86658674c42db3c789abd/StandAlone/StandAlone.cpp#L1138

In RelWithDebInfo, I see an exception thrown during CFGCleanupPass:

Exception thrown at 0x00007FFB06C23FB8 in glslangValidator.exe: Microsoft C++ exception: std::out_of_range at memory location 0x000000190C96DD00.
Unhandled exception at 0x00007FFB06C23FB8 in glslangValidator.exe: Microsoft C++ exception: std::out_of_range at memory location 0x000000190C96DD00.
>   glslangValidator.exe!std::_Xout_of_range(const char * _Message) Line 25 C++
    glslangValidator.exe!spvtools::ir::CFG::block(unsigned int blk_id) Line 41  C++
    glslangValidator.exe!spvtools::opt::MemPass::RemoveUnreachableBlocks::__l2::<lambda>(unsigned int label_id) Line 780    C++
    glslangValidator.exe!spvtools::ir::BasicBlock::ForEachSuccessorLabel(const std::function<void __cdecl(unsigned int)> & f) Line 107  C++
    glslangValidator.exe!spvtools::opt::MemPass::RemoveUnreachableBlocks(spvtools::ir::Function * func) Line 793    C++
    glslangValidator.exe!spvtools::opt::Pass::ProcessCallTreeFromRoots(std::function<bool __cdecl(spvtools::ir::Function *)> & pfn, const std::unordered_map<unsigned int,spvtools::ir::Function *,std::hash<unsigned int>,std::equal_to<unsigned int>,std::allocator<std::pair<unsigned int const ,spvtools::ir::Function *> > > & id2function, std::queue<unsigned int,std::deque<unsigned int,std::allocator<unsigned int> > > * roots) Line 98  C++
    glslangValidator.exe!spvtools::opt::Pass::ProcessReachableCallTree(std::function<bool __cdecl(spvtools::ir::Function *)> & pfn, spvtools::ir::IRContext * irContext) Line 82    C++
    glslangValidator.exe!spvtools::opt::CFGCleanupPass::Process(spvtools::ir::IRContext * c) Line 38    C++
    glslangValidator.exe!spvtools::opt::Pass::Run(spvtools::ir::IRContext * ctx) Line 106   C++
    glslangValidator.exe!spvtools::opt::PassManager::Run(spvtools::ir::IRContext * context) Line 25 C++
    glslangValidator.exe!spvtools::Optimizer::Run(const unsigned int * original_binary, unsigned __int64 original_binary_size, std::vector<unsigned int,std::allocator<unsigned int> > * optimized_binary) Line 131 C++
    glslangValidator.exe!glslang::GlslangToSpv(const glslang::TIntermediate & intermediate, std::vector<unsigned int,std::allocator<unsigned int> > & spirv, spv::SpvBuildLogger * logger, glslang::SpvOptions * options) Line 6106 C++
    glslangValidator.exe!CompileAndLinkShaderUnits(std::vector<ShaderCompUnit,std::allocator<ShaderCompUnit> > compUnits) Line 946  C++
    glslangValidator.exe!CompileAndLinkShaderFiles(glslang::TWorklist & Worklist) Line 1022 C++
    glslangValidator.exe!singleMain() Line 1089 C++
    glslangValidator.exe!main(int argc, char * * argv) Line 1142    C++
dneto0 commented 6 years ago

This is almost certainly a SPIRV-Tools bug. Thanks for the reproducer. Hey @dnovillo how do you want to handle this?

dnovillo commented 6 years ago

I have not been able to reproduce this with current SPIRV-Tools master at 5f100789fb8dc220626996ad8afb160ccb602691. There have been some fixes in the CFG code which may be related to this. Could you try with today's trunk?

dneto0 commented 6 years ago

I'm trying to reproduce with @danginsburg 's commit hashes for Glslang and SPIRV-Tools. So far I haven't reproduced on Linux. I'll try Windows.

dneto0 commented 6 years ago

I've reproduced in VS 2017 with a Release build.

@alan-baker has been working on an independent failure case which might have the same root cause. See https://github.com/KhronosGroup/SPIRV-Tools/issues/1157

danginsburg commented 6 years ago

@dnovillo I updated to SPIRV-Tools master and I am still seeing the crash 5f100789fb8dc220626996ad8afb160ccb602691.

dnovillo commented 6 years ago

Thanks. Based on the stack trace you pasted and today's investigation, this seems to be related to what @alan-baker is triaging in https://github.com/KhronosGroup/SPIRV-Tools/issues/1157

On Wed, Jan 3, 2018 at 4:29 PM, Dan Ginsburg notifications@github.com wrote:

@dnovillo https://github.com/dnovillo I updated to SPIRV-Tools master and I am still seeing the crash 5f100789fb8dc220626996ad8afb160ccb602691.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KhronosGroup/glslang/issues/1205#issuecomment-355132554, or mute the thread https://github.com/notifications/unsubscribe-auth/AG9RZ5civgh_nq7CwguGp6UPbNEwjSsaks5tG_EigaJpZM4RSEni .

dneto0 commented 6 years ago

Further investigation on RelWithDebInfo on VS 2017 strongly suggests this is the same as https://github.com/KhronosGroup/SPIRV-Tools/issues/1157:

I did the following:

dneto0 commented 6 years ago

I've temporarily added a --print-before-all option to the optimizer. I then reran to get a failing case, and I'm attaching the interesting output. compile-log-bad-7965.txt

This run failed on iteration 17. The attached file has the IR dump from before dead branch elimination, and before CFG cleanup (that immediately follows dead branch elim)..

The failure occurs when trying to look up successor block 7965.

Before dead branch elimination, we have:

OpBranchConditional %4020 %7965 %7966
%7965 = OpLabel

where %4020 is bool constant True.

After dead branch elimination ,we have this segment:

%7929 = OpLabel
%7532 = OpVectorShuffle %16 %1754 %1754 0 1 2
%7534 = OpLoad %16 %10837
%7535 = OpFMul %16 %7534 %7532
OpStore %10837 %7535
%7537 = OpLoad %16 %5785
OpStore %7470 %7537
%7538 = OpLoad %16 %5786
OpStore %7471 %7538
OpBranch %7965    ;;;;;;;;;;;;;; This leads nowhere
%7483 = OpLabel

There is an unconditional branch to block 7965 but block 7965 does not exist. This causes the assert failure during CFGCleanup when it tries to lookup the successor blocks to block 7929.

greg-lunarg commented 6 years ago

I will claim this and coordinate with @alan-baker.

dneto0 commented 6 years ago

I will claim this and coordinate with @alan-baker.

@alan-baker and I have been discussing this issue (and a separate case with same symptoms) for the last while. We're focused on dead-branch-elim. He's just made a very small reproducer and we're discussing algorithm changes.

We're also confused by some of the existing code in dead branch elim.
@greg-lunarg : Do you have a small test case for https://github.com/KhronosGroup/SPIRV-Tools/issues/903 ? The fix was https://github.com/KhronosGroup/SPIRV-Tools/pull/883 but it didn't come with a unit test case.

greg-lunarg commented 6 years ago

Do you have a small test case for KhronosGroup/SPIRV-Tools#903 ?

Sorry, no.

dneto0 commented 6 years ago

With extra disassembly dumps I've confirmed there is non-determinism, and can see that in the bad case in dead branch elim deletes some blocks that it doesn't in other cases. That confirms some nondeterminism, and I'll be probing more.

Separately, @alan-baker has wondered about the tracking of live vs. dead blocks and how it depends crucially on the structured-order traversal, but that unusual but valid orderings will confuse the live vs. dead tracking.

dneto0 commented 6 years ago

I've tracked down the non-determinism, and have a local fix.

The dead branch elimination code keeps track of instructions that are backedge branches, as part of the fix for https://github.com/KhronosGroup/SPIRV-Tools/issues/903 It keeps a set of pointers-to-ir::Instruction for this purpose.

In late December @s-perron changed optimizer code so that eliminating an instruction meant actually deleting the instruction object; before we had only been setting the instruction to an OpNop. What can happen is that you can delete a loop, and that loop ends with a backedge branch. That branch will be deleted and it's memory could later be reused. Then you can get a false positive thinking that some newly allocated instruction is a backedge.
My local hacky fix is to track internally-generated unique IDs of backedge branches rather than pointers-to-instruction. That makes the problem go away.
An alternative is to keep the backedge set up to date: whenever a branch is deleted by this pass, try removing it from backedge as well.

greg-lunarg commented 6 years ago

Thanks! I think using the unique IDs is a good solution here. I will go over the other passes I wrote and make sure that any module level data structures holding Instruction* are on solid ground.

dneto0 commented 6 years ago

Thanks! I think using the unique IDs is a good solution here. I will go over the other passes I wrote and make sure that any module level data structures holding Instruction* are on solid ground.

That's the fun thing. The module-level data was good. It's was the local backeges_ set that had dangling pointers to deleted instructions.

I filed https://github.com/KhronosGroup/SPIRV-Tools/pull/1169 to fix the problem.

greg-lunarg commented 6 years ago

Yes. I might have misused the term "module" here. I was thinking of the data structures in each of the Pass classes (modules?) that I wrote, backedges_ being one. Essentially anything with a trailing underscore.

dneto0 commented 6 years ago

Yes. I might have misused the term "module" here. I was thinking of the data structures in each of the Pass classes (modules?) that I wrote, backedges_ being one. Essentially anything with a trailing underscore.

Ah yes, that makes sense now.

The fix https://github.com/KhronosGroup/SPIRV-Tools/pull/1169 for this issue has been merged into SPIRV-Tools master.

I recommend keeping this open until that propagates into Glslang's known-good.

In the meantime:

greg-lunarg commented 6 years ago

Please also disable in -Os :)

greg-lunarg commented 6 years ago

Fixed by #1211

dneto0 commented 6 years ago

Please also disable in -Os :)

Yes, disabled in both. I didn't see it was in both until I looked. https://github.com/KhronosGroup/SPIRV-Tools/pull/1172 has been merged.

johnkslang commented 6 years ago

After getting the latest, cleaning, restarting, and rebuilding, I get a build failure with Visual Studio:

Error   5   error MSB6006: "cmd.exe" exited with code 9009. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Microsoft.CppCommon.targets  170 5   spirv-tools-pkg-config

The Install target further gets this error:

Error   103 error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -DBUILD_TYPE=Release -P cmake_install.cmake
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Microsoft.CppCommon.targets  132 5   INSTALL

Any ideas?

johnkslang commented 6 years ago

More information, Install says:

15>  CMake Error at External/spirv-tools/cmake_install.cmake:40 (file):
15>    file INSTALL cannot find
15>    "C:/cygwin64/home/john/GitHub/glslang/build/External/spirv-tools/SPIRV-Tools.pc".
15>  Call Stack (most recent call first):
15>    External/cmake_install.cmake:33 (include)
15>    cmake_install.cmake:32 (include)
greg-lunarg commented 6 years ago

It is building for me with VS2015 on Win10, both with and without spirv-opt present.

I am doing both the config and build in a VS 2015 Developer Command Window. Here are the commands, executed in the build directory:

cmake .. -DCMAKE_INSTALL_PREFIX="C:\Users\Greg\dev\glslang\build\install"

cmake --build . --config Release --target install

greg-lunarg commented 6 years ago

Also builds in basic Command Prompt window.

greg-lunarg commented 6 years ago

Do you get any message before the first 9009 error message? I think that usually indicates a missing (executable?) file.

johnkslang commented 6 years ago

Okay, will assume something specific to my environment, and close this as fixing the original issue.

danginsburg commented 6 years ago

Thanks all. I can confirm the crash is fixed in our distributed shader compile environment. I also didn't hit the build error @johnkslang mentioned (I built for VS2015 x86/x64, macOS, Linux, and Android).