Open ZapAndersson opened 1 year ago
Due to the intermittivity of this it's hard to debug, and often I get a crash with no useful callstack, only an "abort was called" exception. I will try to figure more out, but if the above gives you any "Heureka" ideas @lgritz let me know
I'm wondering if it can have anything to do with issue #1427 ?
Better call stack, with some of the LLVM stuff untangled: @lgritz
oslexec.dll!llvm::report_fatal_error(const llvm::Twine & Reason, bool GenCrashDiag) Line 122 C++
oslexec.dll!llvm::report_fatal_error(const char * Reason, bool GenCrashDiag) Line 83 C++
> oslexec.dll!llvm::RuntimeDyldCOFFX86_64::resolveRelocation(const llvm::RelocationEntry & RE, unsigned __int64 Value) Line 117 C++
oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocationList(const llvm::SmallVector<llvm::RelocationEntry,64> & Relocs, unsigned __int64 Value) Line 1106 C++
oslexec.dll!llvm::RuntimeDyldImpl::resolveLocalRelocations() Line 149 C++
oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocations() Line 145 C++
oslexec.dll!llvm::MCJIT::finalizeLoadedModules() Line 244 C++
oslexec.dll!llvm::MCJIT::finalizeObject() Line 270 C++
oslexec.dll!OSL_v1_12::pvt::LLVM_Util::getPointerToFunction(llvm::Function * func) Line 1714 C++
oslexec.dll!OSL_v1_12::pvt::BackendLLVM::run() Line 1674 C++
oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::optimize_group(OSL_v1_12::ShaderGroup & group, OSL_v1_12::ShadingContext * ctx, bool do_jit) Line 3595 C++
oslexec.dll!OSL_v1_12::ShadingContext::execute_init(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 91 C++
oslexec.dll!OSL_v1_12::ShadingContext::execute(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 217 C++
oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::execute(OSL_v1_12::ShadingContext & ctx, OSL_v1_12::ShaderGroup & group, int index, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 3264 C++
[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext &) Line 688 C++
[Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext *) Line 695 C++
OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc, int output, bool bump) Line 3227 C++
OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc) Line 2936 C++
3dsmax.exe!RenderTexmapRange::__l5::<lambda_1>::operator()(const tbb::blocked_range<int> & rng) Line 1950 C++
[Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::run_body(tbb::blocked_range<int> &) Line 115 C++
3dsmax.exe!tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>,tbb::blocked_range<int>>(tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const> & start, tbb::blocked_range<int> & range) Line 439 C++
3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::execute() Line 143 C++
The actual abort is here
Called from here:
called from here:
called from here:
Called from here:
Called from:
Called from OSL here (as per the original message above):
I react especially to this line.....
Also, the llvm comment says 2GB and yet the check is done with UINT32_MAX
which is 4GB. Is the comment wrong or should the code be changed to INT32_MAX
? I don't know if that's the source of these problems (if anything it would make the errors happen more often if changed to 2GB).
So it seems this IMAGE_REL_AMD64_ADDR32NB mode is a 32-bit offset based thing, but the one at the end of the above screenshot, IMAGE_REL_AMD65_ADDR64 is true 64 bit.
I mad a Godawful Hack(tm) in LLVM code like so, so any function that made the decision to use the former mode instead used the latter mode....:
....and the problem disappeared.
Now is this a good fix?
I highly doubt it, but.....??
/Z
@lgritz
I think we should report this on the llvm-dev forums, probably in the "code generation" board?
Zap, can you take care of that? I feel like it's more efficient for you to do that communication rather than me having to be the go-between. You're much more familiar with the relevant LLVM stack traces and internals than I am at this point.
I think there are three things to try to get out of that interaction:
Now, on our end, we are in a bit of a pickle in that we still have a lot of work to make OSL work with LLVM 16+. They are close to releasing 17, and definitely will not backport fixes as far back as 15. So you may be forced to maintain those patches on your end at Autodesk (you seem to be the only ones running into this problem) until we can all upgrade to the latest LLVM that would have a fix. But like I said, if they have a suggestion for how to ameliorate the problem from our side, that's the best option.
Well actually I got a lot of (probably great, but I barely understand them due to being a total LLVM noob) replies here: https://github.com/llvm/llvm-project/issues/65641
Does any of that tell you anything?
They say this bit is only relevant for debugging and "exception handling", are we using exception handling in OSL?
They say we can "turn it off and the problem goes away".
/Z
From: Larry Gritz @.> Sent: Thursday, September 7, 2023 6:48 PM To: AcademySoftwareFoundation/OpenShadingLanguage @.> Cc: Zap Andersson @.>; Author @.> Subject: Re: [AcademySoftwareFoundation/OpenShadingLanguage] Intermittent crash in LLVM_Util::getPointerToFunction(llvm::Function* func) (Issue #1712)
EXTERNAL EMAIL : Do not click any links or open any attachments unless you trust the sender and know the content is safe.
I think we should report this on the llvm-dev forumshttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdiscourse.llvm.org%2F&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4iGfvTB%2Fo7mNcRohKib9ucJPlB%2FhcTeUvHDgLXY0XEo%3D&reserved=0, probably in the "code generation" board?
Zap, can you take care of that? I feel like it's more efficient for you to do that communication rather than me having to be the go-between. You're much more familiar with the relevant LLVM stack traces and internals than I am at this point.
I think there are three things to try to get out of that interaction:
Now, on our end, we are in a bit of a pickle in that we still have a lot of work to make OSL work with LLVM 16+. They are close to releasing 17, and definitely will not backport fixes as far back as 15. So you may be forced to maintain those patches on your end at Autodesk (you seem to be the only ones running into this problem) until we can all upgrade to the latest LLVM that would have a fix. But like I said, if they have a suggestion for how to ameliorate the problem from our side, that's the best option.
- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAcademySoftwareFoundation%2FOpenShadingLanguage%2Fissues%2F1712%23issuecomment-1710481755&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U%2FHjW6cTPGvFoDtRh0qvbiTVh0SO8LwsMYbM%2FJPjYJk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYM5MXCX5Y6F64STRC3FU3TXZH3DJANCNFSM6AAAAAA3VJZD6M&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7T%2B%2FvMOhNsXUv8DXhhd0ZgJtQin7U3MosQcOHQ2Vvg4%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>
Yes, lots of good replies at https://github.com/llvm/llvm-project/issues/65641 ...
OSL has a line that reads (in llvm_util.cpp https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1442)
//engine_builder.setCodeModel(llvm::CodeModel::Default);
I'll try to set it to "::Large" or "::Medium" and see if this changes things (apparently ::Small is default?(? Does this make sense?
1 Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.
Well, we have that already. My hack is most certainly WRONG :)
Exceptions: we're definitely not relying on them. But perhaps there is there a way to explicitly turn them off, which we have neglected to do?
setCodeModel: that may be fruitful. What happens if you make this call, and pass llvm::CodeModel::Large
?
In my quick test, setting CodeModel::Large did not change anything, but it was a very late friday semi-aborted test so I will double check. But I could see the condition for this fatal error still getting hit (tho I didn't spend enough time to truly get the crash, I just verified that the "type" of relocation block was still in use.)
Note the latest post on the LLVM project here https://github.com/llvm/llvm-project/issues/65641#issuecomment-1712418435 in reply to my question about "Memory Managers"
If the "MemoryManager" is what doles out this memory to LLVM, then, maybe that is the problem....? According to them OSL is using it's own "MemeoryManager" because....(?)
Okay.... some new info....
OSL uses a custom memory manager, that is held by rendering threads per-thread-info stuff. And this memory manager is kept around until the last rendering thread dies.
Sounds reasonable on paper....
Except... we use TBB for rendering. TBB actually has a set of worker threads that are always in flight. So those threads never die. So the no destructor is ever hit on the per-thread data.
So the memory manager ends up being kept around forever.
That wouldn't be a big deal, in the normal case. Except I also see this in the OSL wrapped memory manager (https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L244):
Okay, so if memory is never ever thrown away, of course we can get beyond a 2GB limit.
I tested it, and in max, the memory manager isn't destroyed until the app closes.....
Problem
In 3ds max, we have lots of users crashing with a callstack that seems to be caused by this problem. We have a scene that "reproduces" the problem, but the reproduction is intermittent and seems to a race condition of sorts. Basically, you load a particular file, you start an interactive render and the material editor at the same time, then start changing parameters in the material many many many many many times. Eventually, we get this crash. Or not. Depending on phase of the moon, the wind direction, humidity, etc.
Crash is reported on this line:
...i.e. in the case this function is reached before the shader has been optimized. Somehow, it seems like the call to exec->finalizeObject(); crashes.
The call stack is something like this:
Expected behavior:
It not to crash?
Actual behavior:
It crash. Sometimes.
Steps to Reproduce
Versions