AcademySoftwareFoundation / OpenShadingLanguage

Advanced shading language for production GI renderers
BSD 3-Clause "New" or "Revised" License
2.09k stars 357 forks source link

Intermittent crash in LLVM_Util::getPointerToFunction(llvm::Function* func) #1712

Open ZapAndersson opened 1 year ago

ZapAndersson commented 1 year ago

Problem

In 3ds max, we have lots of users crashing with a callstack that seems to be caused by this problem. We have a scene that "reproduces" the problem, but the reproduction is intermittent and seems to a race condition of sorts. Basically, you load a particular file, you start an interactive render and the material editor at the same time, then start changing parameters in the material many many many many many times. Eventually, we get this crash. Or not. Depending on phase of the moon, the wind direction, humidity, etc.

Crash is reported on this line:

image

...i.e. in the case this function is reached before the shader has been optimized. Somehow, it seems like the call to exec->finalizeObject(); crashes.

The call stack is something like this:

oslexec.dll!OSL_v1_12::pvt::LLVM_Util::getPointerToFunction(llvm::Function func) Line 1712 C++ oslexec.dll!OSL_v1_12::pvt::BackendLLVM::run() Line 1674 C++ oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::optimize_group(OSL_v1_12::ShaderGroup & group, OSL_v1_12::ShadingContext ctx, bool do_jit) Line 3595 C++ oslexec.dll!OSL_v1_12::ShadingContext::execute_init(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void userdata_base_ptr, void output_base_ptr, bool run) Line 91 C++ oslexec.dll!OSL_v1_12::ShadingContext::execute(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void userdata_base_ptr, void output_base_ptr, bool run) Line 217 C++ oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::execute(OSL_v1_12::ShadingContext & ctx, OSL_v1_12::ShaderGroup & group, int index, OSL_v1_12::ShaderGlobals & ssg, void userdata_base_ptr, void output_base_ptr, bool run) Line 3264 C++ [Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext &) Line 688 C++ [Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext ) Line 695 C++ OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc, int output, bool bump) Line 3227 C++ OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc) Line 2936 C++ 3dsmax.exe!RenderTexmapRange::l5::::operator()(const tbb::blocked_range & rng) Line 1951 C++ [Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::,tbb::auto_partitioner const>::run_body(tbb::blocked_range &) Line 115 C++ 3dsmax.exe!tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::,tbb::auto_partitioner const>,tbb::blocked_range>(tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::,tbb::auto_partitioner const> & start, tbb::blocked_range & range) Line 439 C++ 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::,tbb::auto_partitioner const>::execute() Line 143 C++ [External Code] [Inline Frame] 3dsmax.exe!tbb::task::spawn_root_and_wait(tbb::task &) Line 809 C++ [Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range,RenderTexmapRange'::5'::,tbb::auto_partitioner const>::run(const tbb::blocked_range &) Line 95 C++ [Inline Frame] 3dsmax.exe!tbb::parallel_for(const tbb::blocked_range &) Line 201 C++ 3dsmax.exe!RenderTexmapRange(HWND hwnd, Texmap tx, Bitmap bm, FBox2 range, float scale3d, int filter, int display, int t, const wchar_t name, float z, int mono, bool disableBitmapProxies, bool bake) Line 1925 C++ 3dsmax.exe!RenderTexmap(HWND__ hwnd, Texmap tex, Bitmap bm, float scale3d, int filter, int display, int t, const wchar_t name, float z, int mono, bool disableBitmapProxies, bool bake) Line 1877 C++ 3dsmax.exe!InterfaceImp::Execute(int cmd, unsigned int64 arg1, unsigned int64 arg2, unsigned int64 arg3, unsigned int64 arg4, unsigned int64 arg5, unsigned int64 arg6) Line 6844 C++ core.dll!Texmap::GetVPDisplayDIB(int t, TexHandleMaker & thmaker, Interval & valid, int mono, int forceW, int forceH) Line 3851 C++

Expected behavior:

It not to crash?

Actual behavior:

It crash. Sometimes.

Steps to Reproduce

  1. [First Step]
  2. [Second Step]
  3. [and so on...]

Versions

ZapAndersson commented 1 year ago

Due to the intermittivity of this it's hard to debug, and often I get a crash with no useful callstack, only an "abort was called" exception. I will try to figure more out, but if the above gives you any "Heureka" ideas @lgritz let me know

ZapAndersson commented 1 year ago

I'm wondering if it can have anything to do with issue #1427 ?

ZapAndersson commented 1 year ago

Better call stack, with some of the LLVM stuff untangled: @lgritz

    oslexec.dll!llvm::report_fatal_error(const llvm::Twine & Reason, bool GenCrashDiag) Line 122    C++
    oslexec.dll!llvm::report_fatal_error(const char * Reason, bool GenCrashDiag) Line 83    C++
>   oslexec.dll!llvm::RuntimeDyldCOFFX86_64::resolveRelocation(const llvm::RelocationEntry & RE, unsigned __int64 Value) Line 117   C++
    oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocationList(const llvm::SmallVector<llvm::RelocationEntry,64> & Relocs, unsigned __int64 Value) Line 1106  C++
    oslexec.dll!llvm::RuntimeDyldImpl::resolveLocalRelocations() Line 149   C++
    oslexec.dll!llvm::RuntimeDyldImpl::resolveRelocations() Line 145    C++
    oslexec.dll!llvm::MCJIT::finalizeLoadedModules() Line 244   C++
    oslexec.dll!llvm::MCJIT::finalizeObject() Line 270  C++
    oslexec.dll!OSL_v1_12::pvt::LLVM_Util::getPointerToFunction(llvm::Function * func) Line 1714    C++
    oslexec.dll!OSL_v1_12::pvt::BackendLLVM::run() Line 1674    C++
    oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::optimize_group(OSL_v1_12::ShaderGroup & group, OSL_v1_12::ShadingContext * ctx, bool do_jit) Line 3595   C++
    oslexec.dll!OSL_v1_12::ShadingContext::execute_init(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 91    C++
    oslexec.dll!OSL_v1_12::ShadingContext::execute(OSL_v1_12::ShaderGroup & sgroup, int shadeindex, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 217    C++
    oslexec.dll!OSL_v1_12::pvt::ShadingSystemImpl::execute(OSL_v1_12::ShadingContext & ctx, OSL_v1_12::ShaderGroup & group, int index, OSL_v1_12::ShaderGlobals & ssg, void * userdata_base_ptr, void * output_base_ptr, bool run) Line 3264    C++
    [Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext &) Line 688   C++
    [Inline Frame] OSLMap.dlt!OSL_v1_12::ShadingSystem::execute(OSL_v1_12::ShadingContext *) Line 695   C++
    OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc, int output, bool bump) Line 3227    C++
    OSLMap.dlt!OSLTex::EvalColor(ShadeContext & sc) Line 2936   C++
    3dsmax.exe!RenderTexmapRange::__l5::<lambda_1>::operator()(const tbb::blocked_range<int> & rng) Line 1950   C++
    [Inline Frame] 3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::run_body(tbb::blocked_range<int> &) Line 115  C++
    3dsmax.exe!tbb::interface9::internal::dynamic_grainsize_mode<tbb::interface9::internal::adaptive_mode<tbb::interface9::internal::auto_partition_type>>::work_balance<tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>,tbb::blocked_range<int>>(tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const> & start, tbb::blocked_range<int> & range) Line 439  C++
    3dsmax.exe!tbb::interface9::internal::start_for<tbb::blocked_range<int>,`RenderTexmapRange'::`5'::<lambda_1>,tbb::auto_partitioner const>::execute() Line 143   C++
ZapAndersson commented 1 year ago

The actual abort is here image

Called from here: image

called from here: image

called from here: image

Called from here:

image

Called from: image

Called from OSL here (as per the original message above): image

ZapAndersson commented 1 year ago

I react especially to this line..... image

ThiagoIze commented 1 year ago

Also, the llvm comment says 2GB and yet the check is done with UINT32_MAX which is 4GB. Is the comment wrong or should the code be changed to INT32_MAX? I don't know if that's the source of these problems (if anything it would make the errors happen more often if changed to 2GB).

ZapAndersson commented 1 year ago

So it seems this IMAGE_REL_AMD64_ADDR32NB mode is a 32-bit offset based thing, but the one at the end of the above screenshot, IMAGE_REL_AMD65_ADDR64 is true 64 bit.

I mad a Godawful Hack(tm) in LLVM code like so, so any function that made the decision to use the former mode instead used the latter mode....:

image

....and the problem disappeared.

Now is this a good fix?

I highly doubt it, but.....??

/Z

@lgritz

lgritz commented 1 year ago

I think we should report this on the llvm-dev forums, probably in the "code generation" board?

Zap, can you take care of that? I feel like it's more efficient for you to do that communication rather than me having to be the go-between. You're much more familiar with the relevant LLVM stack traces and internals than I am at this point.

I think there are three things to try to get out of that interaction:

  1. Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.
  2. Convince somebody there to take the ball and turn this (or any other approach they prefer) into a patch that will permanently fix future LLVM releases.
  3. If they have a suggestion for something we can do on the OSL side to avoid this, that's even better. Like, are we hitting a 32 bit limit only because we are being exceptionally silly about what we're handing LLVM, forgetting to clear something between shader group builds, or something like that?

Now, on our end, we are in a bit of a pickle in that we still have a lot of work to make OSL work with LLVM 16+. They are close to releasing 17, and definitely will not backport fixes as far back as 15. So you may be forced to maintain those patches on your end at Autodesk (you seem to be the only ones running into this problem) until we can all upgrade to the latest LLVM that would have a fix. But like I said, if they have a suggestion for how to ameliorate the problem from our side, that's the best option.

ZapAndersson commented 1 year ago

Well actually I got a lot of (probably great, but I barely understand them due to being a total LLVM noob) replies here: https://github.com/llvm/llvm-project/issues/65641

Does any of that tell you anything?

They say this bit is only relevant for debugging and "exception handling", are we using exception handling in OSL?

They say we can "turn it off and the problem goes away".

/Z

From: Larry Gritz @.> Sent: Thursday, September 7, 2023 6:48 PM To: AcademySoftwareFoundation/OpenShadingLanguage @.> Cc: Zap Andersson @.>; Author @.> Subject: Re: [AcademySoftwareFoundation/OpenShadingLanguage] Intermittent crash in LLVM_Util::getPointerToFunction(llvm::Function* func) (Issue #1712)

EXTERNAL EMAIL : Do not click any links or open any attachments unless you trust the sender and know the content is safe.

I think we should report this on the llvm-dev forumshttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fdiscourse.llvm.org%2F&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4iGfvTB%2Fo7mNcRohKib9ucJPlB%2FhcTeUvHDgLXY0XEo%3D&reserved=0, probably in the "code generation" board?

Zap, can you take care of that? I feel like it's more efficient for you to do that communication rather than me having to be the go-between. You're much more familiar with the relevant LLVM stack traces and internals than I am at this point.

I think there are three things to try to get out of that interaction:

  1. Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.
  2. Convince somebody there to take the ball and turn this (or any other approach they prefer) into a patch that will permanently fix future LLVM releases.
  3. If they have a suggestion for something we can do on the OSL side to avoid this, that's even better. Like, are we hitting a 32 bit limit only because we are being exceptionally silly about what we're handing LLVM, forgetting to clear something between shader group builds, or something like that?

Now, on our end, we are in a bit of a pickle in that we still have a lot of work to make OSL work with LLVM 16+. They are close to releasing 17, and definitely will not backport fixes as far back as 15. So you may be forced to maintain those patches on your end at Autodesk (you seem to be the only ones running into this problem) until we can all upgrade to the latest LLVM that would have a fix. But like I said, if they have a suggestion for how to ameliorate the problem from our side, that's the best option.

- Reply to this email directly, view it on GitHubhttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FAcademySoftwareFoundation%2FOpenShadingLanguage%2Fissues%2F1712%23issuecomment-1710481755&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=U%2FHjW6cTPGvFoDtRh0qvbiTVh0SO8LwsMYbM%2FJPjYJk%3D&reserved=0, or unsubscribehttps://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYM5MXCX5Y6F64STRC3FU3TXZH3DJANCNFSM6AAAAAA3VJZD6M&data=05%7C01%7Czap.andersson%40autodesk.com%7C8e7c1dec139a42a1452108dbafc22b2b%7C67bff79e7f914433a8e5c9252d2ddc1d%7C0%7C0%7C638297020704841041%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=7T%2B%2FvMOhNsXUv8DXhhd0ZgJtQin7U3MosQcOHQ2Vvg4%3D&reserved=0. You are receiving this because you authored the thread.Message ID: @.***>

ZapAndersson commented 1 year ago

Yes, lots of good replies at https://github.com/llvm/llvm-project/issues/65641 ...

OSL has a line that reads (in llvm_util.cpp https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L1442)

//engine_builder.setCodeModel(llvm::CodeModel::Default);

I'll try to set it to "::Large" or "::Medium" and see if this changes things (apparently ::Small is default?(? Does this make sense?

ZapAndersson commented 1 year ago

1 Have somebody on the LLVM team confirm that we're on the right track, that this patch is essentially correct and does no additional harm, or else that we're totally misguided and there is something different we should be doing to address the problem.

Well, we have that already. My hack is most certainly WRONG :)

lgritz commented 1 year ago

Exceptions: we're definitely not relying on them. But perhaps there is there a way to explicitly turn them off, which we have neglected to do?

setCodeModel: that may be fruitful. What happens if you make this call, and pass llvm::CodeModel::Large?

ZapAndersson commented 1 year ago

In my quick test, setting CodeModel::Large did not change anything, but it was a very late friday semi-aborted test so I will double check. But I could see the condition for this fatal error still getting hit (tho I didn't spend enough time to truly get the crash, I just verified that the "type" of relocation block was still in use.)

Note the latest post on the LLVM project here https://github.com/llvm/llvm-project/issues/65641#issuecomment-1712418435 in reply to my question about "Memory Managers"

If the "MemoryManager" is what doles out this memory to LLVM, then, maybe that is the problem....? According to them OSL is using it's own "MemeoryManager" because....(?)

ZapAndersson commented 1 year ago

Okay.... some new info....

OSL uses a custom memory manager, that is held by rendering threads per-thread-info stuff. And this memory manager is kept around until the last rendering thread dies.

Sounds reasonable on paper....

Except... we use TBB for rendering. TBB actually has a set of worker threads that are always in flight. So those threads never die. So the no destructor is ever hit on the per-thread data.

So the memory manager ends up being kept around forever.

That wouldn't be a big deal, in the normal case. Except I also see this in the OSL wrapped memory manager (https://github.com/AcademySoftwareFoundation/OpenShadingLanguage/blob/main/src/liboslexec/llvm_util.cpp#L244):

image

Okay, so if memory is never ever thrown away, of course we can get beyond a 2GB limit.

I tested it, and in max, the memory manager isn't destroyed until the app closes.....