ImageEngine / cortex

Libraries for visual effects software development
Other
528 stars 123 forks source link

possible threading issue on Windows build #919

Closed ericmehl closed 4 years ago

ericmehl commented 5 years ago

I was digging into a crash problem with expanding the GafferBot hierarchy in my Windows work-in-progress build of Gaffer and it may be leading me to Cortex so I wanted to raise the issue sooner than later with the Windows Cortex PR #916 under review.

I'm a little out of my depth here but eager to learn so please bear with me if you will. Here are the steps that lead to a crash (all on Windows and I assume unique to that platform):

  1. In Gaffer, load the GafferBot in a SceneReader node
  2. Expand the hierarchy of the full bot -> crash
  3. Expand the hierarchy of a single part -> maybe a crash
  4. Expand the hierarchy of a few parts one after another -> eventually crash
  5. BUT: Expand the hierarchy while the system is under near 100% load from another program -> GafferBot hierarchy expands without problems

I suspect it is a memory access issue as a result of multi-threading. At one point I was looking into the problem while my CPU was under heavy load on another program, slowing down Gaffer quite a bit, and loading the GafferBot worked great. So my theory is that it was running slowly enough that the race condition didn't cause a problem. Does that sound valid?

I'm wondering if this sparks any ideas from @johnhaddon or others who are far more steeped in the code than myself, if it sounds like a Cortex problem or Gaffer, and if you have any tips on debugging this kind of multi-threading problem. I'll continue to dig as much as I can but wanted to get some other brains involved too. Perhaps Windows is missing a data lock somewhere?

I have this stack trace from the crash. (Or here: stack_trace.txt) I'm working on sorting out how to get the Python calls included in the call stack and I'll post that when I can figure it out.

> ntdll.dll!00007ffd7571aed2()
ntdll.dll!00007ffd7572379e()
ntdll.dll!00007ffd75723aaa()
ntdll.dll!00007ffd756bebb1()
ntdll.dll!00007ffd756ccd22()
ucrtbase.dll!00007ffd71badf7b()
[Inline Frame] IECoreScene.dll!std::_Default_allocator_traits<std::allocator<std::_Tree_node<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,IECoreScene::PrimitiveVariable>,void *> > >::deallocate() Line 873
    at c:\program files (x86)\microsoft visual studio\2017\community\vc\tools\msvc\14.16.27023\include\xmemory0(873)
[Inline Frame] IECoreScene.dll!std::_Tree_node<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,IECoreScene::PrimitiveVariable>,void *>::_Freenode0() Line 414
    at c:\program files (x86)\microsoft visual studio\2017\community\vc\tools\msvc\14.16.27023\include\xtree(414)
[Inline Frame] IECoreScene.dll!std::_Tree<std::_Tmap_traits<std::basic_string<char,std::char_traits<char>,std::allocator<char> >,IECoreScene::PrimitiveVariable,std::less<std::basic_string<char,std::char_traits<char>,std::allocator<char> > >,std::allocator<std::pair<std::basic_string<char,std::char_traits<char>,std::allocator<char> > const ,IECoreScene::PrimitiveVariable> >,0> >::erase() Line 1379
    at c:\program files (x86)\microsoft visual studio\2017\community\vc\tools\msvc\14.16.27023\include\xtree(1379)
IECoreScene.dll!`anonymous namespace'::convertLegacyVariables(variables={...}) Line 327
    at m:\cortex\src\iecorescene\primitive.cpp(327)
IECoreScene.dll!IECoreScene::Primitive::load(context={...}) Line 375
    at m:\cortex\src\iecorescene\primitive.cpp(375)
IECoreScene.dll!IECoreScene::MeshPrimitive::load(context={...}) Line 254
    at m:\cortex\src\iecorescene\meshprimitive.cpp(254)
IECore.dll!IECore::Object::LoadContext::loadObject(container) Line 327
    at m:\cortex\src\iecore\object.cpp(327)
IECore.dll!IECore::Object::LoadContext::loadObjectOrReference(container=0x000001a7c913d5c0, name={...}) Line 311
    at m:\cortex\src\iecore\object.cpp(311)
IECore.dll!IECore::Object::LoadContext::load<IECore::Object>(i, name) Line 87
    at m:\cortex\include\iecore\object.inl(87)
IECore.dll!IECore::Object::load(ioInterface={...}, name={...}) Line 567
    at m:\cortex\src\iecore\object.cpp(567)
IECoreScene.dll!IECoreScene::SceneCache::ReaderImplementation::doReadObjectAtSample(key) Line 1107
    at m:\cortex\src\iecorescene\scenecache.cpp(1107)
IECoreScene.dll!boost::detail::function::function_invoker1<boost::intrusive_ptr<IECore::Object> (__cdecl*)(std::pair<IECoreScene::SceneCache::ReaderImplementation const *,unsigned __int64> const &),boost::intrusive_ptr<IECore::Object const >,std::pair<IECoreScene::SceneCache::ReaderImplementation const *,unsigned __int64> const &>::invoke(function_ptr, a0) Line 101
    at m:\gaffer_build\include\boost-1_61\boost\function\function_template.hpp(101)
[Inline Frame] IECoreScene.dll!boost::function1<boost::intrusive_ptr<IECore::Object const >,std::pair<IECoreScene::SceneCache::ReaderImplementation const *,unsigned __int64> const &>::operator()() Line 770
    at m:\gaffer_build\include\boost-1_61\boost\function\function_template.hpp(770)
IECoreScene.dll!IECore::ComputationCache<std::pair<IECoreScene::SceneCache::ReaderImplementation const *,unsigned __int64> >::get(args={...}, missingBehaviour=ComputeIfMissing) Line 103
    at m:\cortex\include\iecore\computationcache.inl(103)
IECoreScene.dll!IECoreScene::SceneCache::ReaderImplementation::SharedData::readObjectAtSample(reader=0x000001a7c87777d0, sample=0) Line 916
    at m:\cortex\src\iecorescene\scenecache.cpp(916)
[Inline Frame] IECoreScene.dll!IECoreScene::SceneCache::ReaderImplementation::readObjectAtSample() Line 587
    at m:\cortex\src\iecorescene\scenecache.cpp(587)
IECoreScene.dll!IECoreScene::SceneCache::readObjectAtSample(sampleIndex=0) Line 2466
    at m:\cortex\src\iecorescene\scenecache.cpp(2466)
IECoreScene.dll!IECoreScene::SampledSceneInterface::readObject(time) Line 161
    at m:\cortex\src\iecorescene\sampledsceneinterface.cpp(161)
GafferScene.dll!GafferScene::SceneReader::computeObject(path, context=0x000001a7c87779b0, parent=0x000001a7c95893d0) Line 334
    at m:\gaffer\src\gafferscene\scenereader.cpp(334)
GafferScene.dll!GafferScene::SceneNode::compute(output=0x000001a7c9587880, context=0x000001a7c87779b0) Line 270
    at m:\gaffer\src\gafferscene\scenenode.cpp(270)
Gaffer.dll!Gaffer::ValuePlug::ComputeProcess::ComputeProcess(plug=0x000001a7c9587880, downstream=0x0000003b59e93850) Line 383
    at m:\gaffer\src\gaffer\valueplug.cpp(383)
Gaffer.dll!Gaffer::ValuePlug::ComputeProcess::value(plug=0x000001a7c8c03000, precomputedHash=0x0000000000000000, cachedOnly) Line 304
    at m:\gaffer\src\gaffer\valueplug.cpp(304)
Gaffer.dll!Gaffer::ValuePlug::getObjectValue(precomputedHash) Line 674
    at m:\gaffer\src\gaffer\valueplug.cpp(674)
Gaffer.dll!Gaffer::TypedObjectPlug<IECore::Object>::getValue(precomputedHash) Line 101
    at m:\gaffer\include\gaffer\typedobjectplug.inl(101)
GafferScene.dll!GafferScene::DeleteObject::computeObject(path={...}, context, parent=0x000001a7b8d26850) Line 126
    at m:\gaffer\src\gafferscene\deleteobject.cpp(126)
GafferScene.dll!GafferScene::SceneNode::compute(output=0x000001a7b8d24bd0, context=0x000001a7c87779b0) Line 270
    at m:\gaffer\src\gafferscene\scenenode.cpp(270)
Gaffer.dll!Gaffer::ValuePlug::ComputeProcess::ComputeProcess(plug=0x000001a7b8d24bd0, downstream=0x0000003b59e93ca0) Line 383
    at m:\gaffer\src\gaffer\valueplug.cpp(383)
Gaffer.dll!Gaffer::ValuePlug::ComputeProcess::value(plug=0x000001a7c8c00540, precomputedHash=0x0000003b59e93de0, cachedOnly) Line 304
    at m:\gaffer\src\gaffer\valueplug.cpp(304)
Gaffer.dll!Gaffer::ValuePlug::getObjectValue(precomputedHash) Line 674
    at m:\gaffer\src\gaffer\valueplug.cpp(674)
Gaffer.dll!Gaffer::TypedObjectPlug<IECore::Object>::getValue(precomputedHash) Line 101
    at m:\gaffer\include\gaffer\typedobjectplug.inl(101)
GafferScene.dll!GafferScene::RenderController::SceneGraph::updateObject(objectPlug=0x000001a7c8c00540, type=ObjectType, renderer=0x000001a7c6774c40, globals=0x000001a7b8ffc1c0, scene=0x000001a7c8c02a10) Line 491
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(491)
GafferScene.dll!GafferScene::RenderController::SceneGraph::update(path={...}, changedGlobals, type=ObjectType, controller=0x000001a7c926c618) Line 244
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(244)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 748
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(748)
tbb.dll!00007ffd57c7f3ce()
[Inline Frame] GafferScene.dll!tbb::task::wait_for_all() Line 760
    at m:\gaffer_build\include\tbb\task.h(760)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 769
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(769)
tbb.dll!00007ffd57c7f3ce()
[Inline Frame] GafferScene.dll!tbb::task::wait_for_all() Line 760
    at m:\gaffer_build\include\tbb\task.h(760)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 769
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(769)
tbb.dll!00007ffd57c7f3ce()
[Inline Frame] GafferScene.dll!tbb::task::wait_for_all() Line 760
    at m:\gaffer_build\include\tbb\task.h(760)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 769
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(769)
tbb.dll!00007ffd57c7f3ce()
[Inline Frame] GafferScene.dll!tbb::task::wait_for_all() Line 760
    at m:\gaffer_build\include\tbb\task.h(760)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 769
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(769)
tbb.dll!00007ffd57c7f3ce()
[Inline Frame] GafferScene.dll!tbb::task::wait_for_all() Line 760
    at m:\gaffer_build\include\tbb\task.h(760)
GafferScene.dll!GafferScene::RenderController::SceneGraphUpdateTask::execute() Line 769
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(769)
tbb.dll!00007ffd57c7f3ce()
tbb.dll!00007ffd57c80a1f()
[Inline Frame] GafferScene.dll!tbb::task::spawn_root_and_wait() Line 749
    at m:\gaffer_build\include\tbb\task.h(749)
GafferScene.dll!GafferScene::RenderController::updateInternal(callback={...}, pathsToUpdate=0x0000000000000000) Line 1148
    at m:\gaffer\src\gafferscene\rendercontroller.cpp(1148)
[Inline Frame] Gaffer.dll!std::_Func_class<void>::operator()()
[Inline Frame] Gaffer.dll!Gaffer::ParallelAlgo::callOnBackgroundThread::__l2::<lambda_7d48a069dbe2f868092036970752470e>::operator()() Line 92
    at m:\gaffer\src\gaffer\parallelalgo.cpp(92)
[Inline Frame] Gaffer.dll!std::_Invoker_functor::_Call()
[Inline Frame] Gaffer.dll!std::invoke()
[Inline Frame] Gaffer.dll!std::_Invoker_ret<void,1>::_Call()
Gaffer.dll!std::_Func_impl_no_alloc<<lambda_7d48a069dbe2f868092036970752470e>,void,IECore::Canceller const &>::_Do_call(<_Args_0>)
[Inline Frame] Gaffer.dll!std::_Func_class<void,IECore::Canceller const &>::operator()() Line 184
    at m:\gaffer\src\gaffer\backgroundtask.cpp(184)
Gaffer.dll!Gaffer::BackgroundTask::{ctor}::__l2::<lambda>() Line 185
    at m:\gaffer\src\gaffer\backgroundtask.cpp(185)
[Inline Frame] Gaffer.dll!std::_Func_class<void>::operator()() Line 71
    at m:\gaffer\src\gaffer\backgroundtask.cpp(71)
Gaffer.dll!`anonymous namespace'::FunctionTask::execute() Line 72
    at m:\gaffer\src\gaffer\backgroundtask.cpp(72)
tbb.dll!00007ffd57c7f3ce()
tbb.dll!00007ffd57c7361d()
tbb.dll!00007ffd57c78ff4()
tbb.dll!00007ffd57c7bd2c()
tbb.dll!00007ffd57c7becd()
ucrtbase.dll!00007ffd71bc03ba()
kernel32.dll!00007ffd74917e94()
ntdll.dll!00007ffd7568a251()
andrewkaufman commented 5 years ago

Based on the stacktrace, I'd say its dying here: https://github.com/ImageEngine/cortex/blob/master/src/IECoreScene/Primitive.cpp#L327-L330

But (unless I'm missing something obvious) I don't know why that'd be impacted by threading. To verify the threading theory, can you run with gaffer -threads 1 and see if it can/cannot be triggered?

It'd be nice to know which primvar is crashing... maybe you can add a std:cerr inside that loop (or attach a debugger)?

johnhaddon commented 5 years ago

My guess would be that we're somehow accessing memory that has already been freed, and get away with it in the single-threaded case because the memory hasn't been reassigned on another thread. That's just a guess though. I spent a while peering at that code in Primitive.cpp, and didn't see a problem. Do you get the same crash @boberfly?

ericmehl commented 5 years ago

Adding -threads 1 solves the problem, GafferBot displays as expected.

I added a std::cerr output in that loop but I'm a little unsure of exactly which variable it's getting tripped up on. Here's the output when I shift+click on the root of the GafferBot hierarchy:

stIndices stIndices s t stIndices s t stIndices s t s t ERROR : MeshAlgo::triangulate : Mesh with invalid primitive variables stIndices s t ERROR : BackgroundTask : Type "" is not a registered Object type. stIndices s stIndices s t t stIndices s t (empty line)

I put the cerr output just before the variables.erase, so that output is what it's just about to erase.

Notably there is a space as the last line before it returns to the command prompt (that's the "empty line" bit) after the crash so I take that to mean it's trying to erase a non-existent variable.

I'm also getting that error from BackgroundTask which I suspect is related to all this - an object type being set to an invalid memory location and getting scrambled data?

ericmehl commented 5 years ago

If I track one level up in the call stack (the ::erase function third [Inline Frame] from the top) I more information on what it's trying to erase:

  Name Value Type
_Myval ("", {interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data=...} } } } } ...}) std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,IECoreScene::PrimitiveVariable>
  ◢ first "" const std::basic_string<char,std::char_traits,std::allocator >
  [size] 0 unsigned __int64
  [capacity] 15 unsigned __int64
  ▶ [allocator] allocator std::_Compressed_pair<std::allocator,std::_String_val<std::_Simple_types >,1>
  ▶ [Raw View] {...} const std::basic_string<char,std::char_traits,std::allocator >
  ◢ second {interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data=...} } } } } ...} IECoreScene::PrimitiveVariable
  interpolation FaceVarying (5) IECoreScene::PrimitiveVariable::Interpolation
  ▶ data {px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data={ size=1332 } hash={m_h1=0 m_h2=...} ...} } } } } boost::intrusive_ptr
  ▶ indices {px=0x0000000000000000 } boost::intrusive_ptr<IECore::TypedData<std::vector<int,std::allocator > > >
  ▶ [Raw View] {first="" second={interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {...} } } } } ...} } std::pair<std::basic_string<char,std::char_traits,std::allocator > const ,IECoreScene::PrimitiveVariable>
boberfly commented 5 years ago

I'll need to build and try here again the windows build. I did get something similar to this on Linux when my cycles code frees the mesh primitive that was used to triangulate via TriangulateAlgo when it goes out of scope and the intrusive ptr frees, but that only happened on the cow.scc and not gaffer bot. Could be related?

ericmehl commented 5 years ago

I don't have much to offer on whether the cow and GafferBot issues are related, but for info the cow is not giving me any problems. Including when I parent a few cows together to test multiple scene locations at once - all smooth.

johnhaddon commented 5 years ago

I'm becoming suspicious of the WindowsPlatformReader added in this commit. It's badly named perhaps, but the PlatformReader class is meant to provide thread-safe random access reads to the file, without the non-thread-safe state implied by seek(). WindowsPlatformReader appears to be doing a seek internally, with no protection against conflicts among threads. I suspect this is the root cause of the crashes.

Comments in the code imply that if no PlatformReader is available, we fall back to locked calls to seek()/read(), so perhaps the simplest fix in the first instance is to just remove WindowsPlatformReader?

boberfly commented 5 years ago

Hmm I think on Windows to implement this without the lock you need to use its mmap equivalent, if you can read rust you can get a general idea on what to do here: https://github.com/vasi/positioned-io/blob/master/src/windows.rs#L47

johnhaddon commented 5 years ago

Hmm I think on Windows to implement this without the lock you need to use its mmap equivalent

I don't think that is necessary. See USD/Alembic :

https://github.com/PixarAnimationStudios/USD/blob/2f1494e43e430d7f4187968fd6888d49ade3be80/pxr/base/lib/arch/fileSystem.cpp#L498-L513 https://github.com/alembic/alembic/commit/8bc84e40159388d309428f3d15f5cd29ac44970a#diff-5b427e4373d7ae3d5db85da3603360a4

In any case, I suggest the first thing to do is to get everything working even if it means locking. Then we can do an optimisation pass once everything is stable and tested.

ericmehl commented 5 years ago

Removing the WindowsPlatformReader did the trick! GafferBot is loading up just fine now. I force pushed an update to StreamIndexedIO with that taken out for the PR.

I also had success with a WindowsPlatformReader styled after the USD code from that link, thanks a lot for digging it up.

I'd like to do some performance testing on some different WindowsPlatformReader implementations so I'm keeping that in a separate branch for now and I figure I'll do a follow-up PR once I have a better idea of how all of this is working with multi-threading.

I'm open to discussion of course.

johnhaddon commented 4 years ago

Closing this, since we got to the bottom of it long ago.