Closed ericmehl closed 4 years ago
Based on the stacktrace, I'd say its dying here: https://github.com/ImageEngine/cortex/blob/master/src/IECoreScene/Primitive.cpp#L327-L330
But (unless I'm missing something obvious) I don't know why that'd be impacted by threading. To verify the threading theory, can you run with gaffer -threads 1
and see if it can/cannot be triggered?
It'd be nice to know which primvar is crashing... maybe you can add a std:cerr
inside that loop (or attach a debugger)?
My guess would be that we're somehow accessing memory that has already been freed, and get away with it in the single-threaded case because the memory hasn't been reassigned on another thread. That's just a guess though. I spent a while peering at that code in Primitive.cpp, and didn't see a problem. Do you get the same crash @boberfly?
Adding -threads 1 solves the problem, GafferBot displays as expected.
I added a std::cerr output in that loop but I'm a little unsure of exactly which variable it's getting tripped up on. Here's the output when I shift+click on the root of the GafferBot hierarchy:
stIndices stIndices s t stIndices s t stIndices s t s t ERROR : MeshAlgo::triangulate : Mesh with invalid primitive variables stIndices s t ERROR : BackgroundTask : Type "" is not a registered Object type. stIndices s stIndices s t t stIndices s t (empty line)
I put the cerr output just before the variables.erase, so that output is what it's just about to erase.
Notably there is a space as the last line before it returns to the command prompt (that's the "empty line" bit) after the crash so I take that to mean it's trying to erase a non-existent variable.
I'm also getting that error from BackgroundTask which I suspect is related to all this - an object type being set to an invalid memory location and getting scrambled data?
If I track one level up in the call stack (the ::erase function third [Inline Frame] from the top) I more information on what it's trying to erase:
Name | Value | Type | |
---|---|---|---|
◢ | _Myval | ("", {interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data=...} } } } } ...}) | std::pair<std::basic_string<char,std::char_traits |
◢ first | "" | const std::basic_string<char,std::char_traits |
|
[size] | 0 | unsigned __int64 | |
[capacity] | 15 | unsigned __int64 | |
▶ [allocator] | allocator | std::_Compressed_pair<std::allocator |
|
▶ [Raw View] | {...} | const std::basic_string<char,std::char_traits |
|
◢ second | {interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data=...} } } } } ...} | IECoreScene::PrimitiveVariable | |
interpolation | FaceVarying (5) | IECoreScene::PrimitiveVariable::Interpolation | |
▶ data | {px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {data={ size=1332 } hash={m_h1=0 m_h2=...} ...} } } } } | boost::intrusive_ptr |
|
▶ indices | {px=0x0000000000000000 |
boost::intrusive_ptr<IECore::TypedData<std::vector<int,std::allocator |
|
▶ [Raw View] | {first="" second={interpolation=FaceVarying (5) data={px=0x000001817e03e0f0 {m_data={m_data={px=0x000001817ddba0c0 {...} } } } } ...} } | std::pair<std::basic_string<char,std::char_traits |
I'll need to build and try here again the windows build. I did get something similar to this on Linux when my cycles code frees the mesh primitive that was used to triangulate via TriangulateAlgo when it goes out of scope and the intrusive ptr frees, but that only happened on the cow.scc and not gaffer bot. Could be related?
I don't have much to offer on whether the cow and GafferBot issues are related, but for info the cow is not giving me any problems. Including when I parent a few cows together to test multiple scene locations at once - all smooth.
I'm becoming suspicious of the WindowsPlatformReader added in this commit. It's badly named perhaps, but the PlatformReader class is meant to provide thread-safe random access reads to the file, without the non-thread-safe state implied by seek()
. WindowsPlatformReader appears to be doing a seek internally, with no protection against conflicts among threads. I suspect this is the root cause of the crashes.
Comments in the code imply that if no PlatformReader is available, we fall back to locked calls to seek()/read()
, so perhaps the simplest fix in the first instance is to just remove WindowsPlatformReader?
Hmm I think on Windows to implement this without the lock you need to use its mmap equivalent, if you can read rust you can get a general idea on what to do here: https://github.com/vasi/positioned-io/blob/master/src/windows.rs#L47
Hmm I think on Windows to implement this without the lock you need to use its mmap equivalent
I don't think that is necessary. See USD/Alembic :
https://github.com/PixarAnimationStudios/USD/blob/2f1494e43e430d7f4187968fd6888d49ade3be80/pxr/base/lib/arch/fileSystem.cpp#L498-L513 https://github.com/alembic/alembic/commit/8bc84e40159388d309428f3d15f5cd29ac44970a#diff-5b427e4373d7ae3d5db85da3603360a4
In any case, I suggest the first thing to do is to get everything working even if it means locking. Then we can do an optimisation pass once everything is stable and tested.
Removing the WindowsPlatformReader did the trick! GafferBot is loading up just fine now. I force pushed an update to StreamIndexedIO with that taken out for the PR.
I also had success with a WindowsPlatformReader styled after the USD code from that link, thanks a lot for digging it up.
I'd like to do some performance testing on some different WindowsPlatformReader implementations so I'm keeping that in a separate branch for now and I figure I'll do a follow-up PR once I have a better idea of how all of this is working with multi-threading.
I'm open to discussion of course.
Closing this, since we got to the bottom of it long ago.
I was digging into a crash problem with expanding the GafferBot hierarchy in my Windows work-in-progress build of Gaffer and it may be leading me to Cortex so I wanted to raise the issue sooner than later with the Windows Cortex PR #916 under review.
I'm a little out of my depth here but eager to learn so please bear with me if you will. Here are the steps that lead to a crash (all on Windows and I assume unique to that platform):
I suspect it is a memory access issue as a result of multi-threading. At one point I was looking into the problem while my CPU was under heavy load on another program, slowing down Gaffer quite a bit, and loading the GafferBot worked great. So my theory is that it was running slowly enough that the race condition didn't cause a problem. Does that sound valid?
I'm wondering if this sparks any ideas from @johnhaddon or others who are far more steeped in the code than myself, if it sounds like a Cortex problem or Gaffer, and if you have any tips on debugging this kind of multi-threading problem. I'll continue to dig as much as I can but wanted to get some other brains involved too. Perhaps Windows is missing a data lock somewhere?
I have this stack trace from the crash. (Or here: stack_trace.txt) I'm working on sorting out how to get the Python calls included in the call stack and I'll post that when I can figure it out.