Closed danginsburg closed 6 years ago
The specific pass being run during the assert is spvtools::opt::DeadBranchElimPass.
I found a way to reproduce the crash reliably. Change iterations to 100 here in StandAlone.cpp:
In RelWithDebInfo, I see an exception thrown during CFGCleanupPass:
Exception thrown at 0x00007FFB06C23FB8 in glslangValidator.exe: Microsoft C++ exception: std::out_of_range at memory location 0x000000190C96DD00.
Unhandled exception at 0x00007FFB06C23FB8 in glslangValidator.exe: Microsoft C++ exception: std::out_of_range at memory location 0x000000190C96DD00.
> glslangValidator.exe!std::_Xout_of_range(const char * _Message) Line 25 C++
glslangValidator.exe!spvtools::ir::CFG::block(unsigned int blk_id) Line 41 C++
glslangValidator.exe!spvtools::opt::MemPass::RemoveUnreachableBlocks::__l2::<lambda>(unsigned int label_id) Line 780 C++
glslangValidator.exe!spvtools::ir::BasicBlock::ForEachSuccessorLabel(const std::function<void __cdecl(unsigned int)> & f) Line 107 C++
glslangValidator.exe!spvtools::opt::MemPass::RemoveUnreachableBlocks(spvtools::ir::Function * func) Line 793 C++
glslangValidator.exe!spvtools::opt::Pass::ProcessCallTreeFromRoots(std::function<bool __cdecl(spvtools::ir::Function *)> & pfn, const std::unordered_map<unsigned int,spvtools::ir::Function *,std::hash<unsigned int>,std::equal_to<unsigned int>,std::allocator<std::pair<unsigned int const ,spvtools::ir::Function *> > > & id2function, std::queue<unsigned int,std::deque<unsigned int,std::allocator<unsigned int> > > * roots) Line 98 C++
glslangValidator.exe!spvtools::opt::Pass::ProcessReachableCallTree(std::function<bool __cdecl(spvtools::ir::Function *)> & pfn, spvtools::ir::IRContext * irContext) Line 82 C++
glslangValidator.exe!spvtools::opt::CFGCleanupPass::Process(spvtools::ir::IRContext * c) Line 38 C++
glslangValidator.exe!spvtools::opt::Pass::Run(spvtools::ir::IRContext * ctx) Line 106 C++
glslangValidator.exe!spvtools::opt::PassManager::Run(spvtools::ir::IRContext * context) Line 25 C++
glslangValidator.exe!spvtools::Optimizer::Run(const unsigned int * original_binary, unsigned __int64 original_binary_size, std::vector<unsigned int,std::allocator<unsigned int> > * optimized_binary) Line 131 C++
glslangValidator.exe!glslang::GlslangToSpv(const glslang::TIntermediate & intermediate, std::vector<unsigned int,std::allocator<unsigned int> > & spirv, spv::SpvBuildLogger * logger, glslang::SpvOptions * options) Line 6106 C++
glslangValidator.exe!CompileAndLinkShaderUnits(std::vector<ShaderCompUnit,std::allocator<ShaderCompUnit> > compUnits) Line 946 C++
glslangValidator.exe!CompileAndLinkShaderFiles(glslang::TWorklist & Worklist) Line 1022 C++
glslangValidator.exe!singleMain() Line 1089 C++
glslangValidator.exe!main(int argc, char * * argv) Line 1142 C++
This is almost certainly a SPIRV-Tools bug. Thanks for the reproducer. Hey @dnovillo how do you want to handle this?
I have not been able to reproduce this with current SPIRV-Tools master at 5f100789fb8dc220626996ad8afb160ccb602691. There have been some fixes in the CFG code which may be related to this. Could you try with today's trunk?
I'm trying to reproduce with @danginsburg 's commit hashes for Glslang and SPIRV-Tools. So far I haven't reproduced on Linux. I'll try Windows.
I've reproduced in VS 2017 with a Release build.
@alan-baker has been working on an independent failure case which might have the same root cause. See https://github.com/KhronosGroup/SPIRV-Tools/issues/1157
@dnovillo I updated to SPIRV-Tools master and I am still seeing the crash 5f100789fb8dc220626996ad8afb160ccb602691.
Thanks. Based on the stack trace you pasted and today's investigation, this seems to be related to what @alan-baker is triaging in https://github.com/KhronosGroup/SPIRV-Tools/issues/1157
On Wed, Jan 3, 2018 at 4:29 PM, Dan Ginsburg notifications@github.com wrote:
@dnovillo https://github.com/dnovillo I updated to SPIRV-Tools master and I am still seeing the crash 5f100789fb8dc220626996ad8afb160ccb602691.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/KhronosGroup/glslang/issues/1205#issuecomment-355132554, or mute the thread https://github.com/notifications/unsubscribe-auth/AG9RZ5civgh_nq7CwguGp6UPbNEwjSsaks5tG_EigaJpZM4RSEni .
Further investigation on RelWithDebInfo on VS 2017 strongly suggests this is the same as https://github.com/KhronosGroup/SPIRV-Tools/issues/1157:
I did the following:
Recompile with SPIRV-Tools at latest master. This seems to expose the assert failure in fewer iterations on average, which is nice.
Ran 100-iteration glslang with -Os, and I reliably reproduce the assert failure in CFGCleanup as @danginsburg describes above. E.g. 7th iteration, with label_id 8112 12th iteration, label_id 8916 8th iteration, label_id 8916 26th iteration, label_id 6942 12th iteration, label_id 8916 33rd iteration, label_id 6942 Aside: It's interesting that we're seeing repeated label_ids. That might be useful if we want to stare at the SPIR-V assembly to see what's going on.
Remove dead branch elimination from Glslang's size-optimization recipe. Then I've run 100-iteration Glslang with -Os (without dead branch elim) and it works without failure. I tried this 5 times in a row.
I've temporarily added a --print-before-all option to the optimizer. I then reran to get a failing case, and I'm attaching the interesting output. compile-log-bad-7965.txt
This run failed on iteration 17. The attached file has the IR dump from before dead branch elimination, and before CFG cleanup (that immediately follows dead branch elim)..
The failure occurs when trying to look up successor block 7965.
Before dead branch elimination, we have:
OpBranchConditional %4020 %7965 %7966
%7965 = OpLabel
where %4020 is bool constant True.
After dead branch elimination ,we have this segment:
%7929 = OpLabel
%7532 = OpVectorShuffle %16 %1754 %1754 0 1 2
%7534 = OpLoad %16 %10837
%7535 = OpFMul %16 %7534 %7532
OpStore %10837 %7535
%7537 = OpLoad %16 %5785
OpStore %7470 %7537
%7538 = OpLoad %16 %5786
OpStore %7471 %7538
OpBranch %7965 ;;;;;;;;;;;;;; This leads nowhere
%7483 = OpLabel
There is an unconditional branch to block 7965 but block 7965 does not exist. This causes the assert failure during CFGCleanup when it tries to lookup the successor blocks to block 7929.
I will claim this and coordinate with @alan-baker.
I will claim this and coordinate with @alan-baker.
@alan-baker and I have been discussing this issue (and a separate case with same symptoms) for the last while. We're focused on dead-branch-elim. He's just made a very small reproducer and we're discussing algorithm changes.
We're also confused by some of the existing code in dead branch elim.
@greg-lunarg : Do you have a small test case for https://github.com/KhronosGroup/SPIRV-Tools/issues/903 ? The fix was https://github.com/KhronosGroup/SPIRV-Tools/pull/883 but it didn't come with a unit test case.
Do you have a small test case for KhronosGroup/SPIRV-Tools#903 ?
Sorry, no.
With extra disassembly dumps I've confirmed there is non-determinism, and can see that in the bad case in dead branch elim deletes some blocks that it doesn't in other cases. That confirms some nondeterminism, and I'll be probing more.
Separately, @alan-baker has wondered about the tracking of live vs. dead blocks and how it depends crucially on the structured-order traversal, but that unusual but valid orderings will confuse the live vs. dead tracking.
I've tracked down the non-determinism, and have a local fix.
The dead branch elimination code keeps track of instructions that are backedge branches, as part of the fix for https://github.com/KhronosGroup/SPIRV-Tools/issues/903 It keeps a set of pointers-to-ir::Instruction for this purpose.
In late December @s-perron changed optimizer code so that eliminating an instruction meant actually deleting the instruction object; before we had only been setting the instruction to an OpNop. What can happen is that you can delete a loop, and that loop ends with a backedge branch. That branch will be deleted and it's memory could later be reused. Then you can get a false positive thinking that some newly allocated instruction is a backedge.
My local hacky fix is to track internally-generated unique IDs of backedge branches rather than pointers-to-instruction. That makes the problem go away.
An alternative is to keep the backedge set up to date: whenever a branch is deleted by this pass, try removing it from backedge as well.
Thanks! I think using the unique IDs is a good solution here. I will go over the other passes I wrote and make sure that any module level data structures holding Instruction* are on solid ground.
Thanks! I think using the unique IDs is a good solution here. I will go over the other passes I wrote and make sure that any module level data structures holding Instruction* are on solid ground.
That's the fun thing. The module-level data was good. It's was the local backeges_ set that had dangling pointers to deleted instructions.
I filed https://github.com/KhronosGroup/SPIRV-Tools/pull/1169 to fix the problem.
Yes. I might have misused the term "module" here. I was thinking of the data structures in each of the Pass classes (modules?) that I wrote, backedges_ being one. Essentially anything with a trailing underscore.
Yes. I might have misused the term "module" here. I was thinking of the data structures in each of the Pass classes (modules?) that I wrote, backedges_ being one. Essentially anything with a trailing underscore.
Ah yes, that makes sense now.
The fix https://github.com/KhronosGroup/SPIRV-Tools/pull/1169 for this issue has been merged into SPIRV-Tools master.
I recommend keeping this open until that propagates into Glslang's known-good.
In the meantime:
Please also disable in -Os :)
Fixed by #1211
Please also disable in -Os :)
Yes, disabled in both. I didn't see it was in both until I looked. https://github.com/KhronosGroup/SPIRV-Tools/pull/1172 has been merged.
After getting the latest, cleaning, restarting, and rebuilding, I get a build failure with Visual Studio:
Error 5 error MSB6006: "cmd.exe" exited with code 9009. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Microsoft.CppCommon.targets 170 5 spirv-tools-pkg-config
The Install
target further gets this error:
Error 103 error MSB3073: The command "setlocal
"C:\Program Files\CMake\bin\cmake.exe" -DBUILD_TYPE=Release -P cmake_install.cmake
if %errorlevel% neq 0 goto :cmEnd
:cmEnd
endlocal & call :cmErrorLevel %errorlevel% & goto :cmDone
:cmErrorLevel
exit /b %1
:cmDone
if %errorlevel% neq 0 goto :VCEnd
:VCEnd" exited with code 1. C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Microsoft.CppCommon.targets 132 5 INSTALL
Any ideas?
More information, Install says:
15> CMake Error at External/spirv-tools/cmake_install.cmake:40 (file):
15> file INSTALL cannot find
15> "C:/cygwin64/home/john/GitHub/glslang/build/External/spirv-tools/SPIRV-Tools.pc".
15> Call Stack (most recent call first):
15> External/cmake_install.cmake:33 (include)
15> cmake_install.cmake:32 (include)
It is building for me with VS2015 on Win10, both with and without spirv-opt present.
I am doing both the config and build in a VS 2015 Developer Command Window. Here are the commands, executed in the build directory:
cmake .. -DCMAKE_INSTALL_PREFIX="C:\Users\Greg\dev\glslang\build\install"
cmake --build . --config Release --target install
Also builds in basic Command Prompt window.
Do you get any message before the first 9009 error message? I think that usually indicates a missing (executable?) file.
Okay, will assume something specific to my environment, and close this as fixing the original issue.
Thanks all. I can confirm the crash is fixed in our distributed shader compile environment. I also didn't hit the build error @johnkslang mentioned (I built for VS2015 x86/x64, macOS, Linux, and Android).
I updated to the latest glslang master (to commit 798d005c) and in trying to rebuild all of our HLSL shaders ran into an intermittent crash in Release (and intermittent assert in Debug). It is most likely a spirv-opt bug, but it is hard for me to say for sure because it is hard to reproduce.
Repro:
Intermittently, I will see the following assert:
In our distributed shader compile system that compiles 100k+ shaders, I will intermittently get a crash. I've been unable to isolate this to a 100% reproducer. However, if I run in a debug build with the command-line above enough times I will see the assert from time-to-time. It seems like some kind of memory corruption or non-determinism.
crash_hlsl.zip