NVIDIAGameWorks / PhysX-3.4

NVIDIA PhysX SDK 3.4
https://www.nvidia.com/
2.35k stars 274 forks source link

GPURB - Rare soft-crash when modifying many rigid bodies at once #29

Closed marzer closed 7 years ago

marzer commented 7 years ago

I've been playing around with GPU rigid bodies in a sandbox application, and every so often I experience a soft-crash when modifying many rigid bodies at once (between simulation ticks). You can see an example of it occurring in this video (I get it to occur at 1:21, but watch from at least 1:00 for context). It doesn't happen very often (say, 1 in 30 times doing the above), but when it does it brings down PhysX entirely.

You can see the error log spam in the error console after it occurs, but the very first errors reported are:

[PxgNarrowphaseCore.cpp:1039] memcpy failed fail! 2  700
[PxgNarrowphaseCore.cpp:1049] GPU cudaMainGjkEpa or prepareLostFoundPairs kernel fail!
  700
[PxgNarrowphaseCore.cpp:1274] memcpy failed fail!
  700
[PxgCudaUtils.h:53] SynchronizeStreams failed
[PxgCudaSolverCore.cpp:1021] GPU contactConstraintPrepare fail to launch kernel!!

The 'grenade explosion' is a fairly simple loop over all dynamic rigid bodies and adding force based on their distance and direction from the point of origin (source). I haven't seen the same/similar issue with GPU rigid bodies disabled.

The relevant code is linked above, but [here's the sandbox] I'm using in the video.

kstorey-nvidia commented 7 years ago

Thanks for reporting. I'll try my best to repro with your sandbox and report back if/when we have any news.

marzer commented 7 years ago

Awesome. That build doesn't have any PDB's since it's one I uploaded to share with friends, but I've made the binaries for a more current one, and the associated PDB's, available for you to download [here]. You'll still need the assets from the version linked above, but these should just drop in over the top of the binaries in x64/Release/.

kstorey-nvidia commented 7 years ago

I've gotten reasonably close to debugging the issue but I've hit a wall that perhaps you can help me with @marzer . I've tracked it down to a particular block of GPU code that is crashing due to an illegal memory access. I can't debug this case because it's too complex so becomes unusably slow if I attempt to use a GPU debugger.

However, what I've done in the past is to introduce validation in the code to try and throw up errors to figure out which access is bad. To do this, I can introduce checks followed by printf in the CUDA code to report bad values. However, your application doesn't seem to display the results of printf - they're not visible on either your internal console or the output in visual studio. I know I can use OutputDebugString in CPU code but that's not available in CUDA kernels.

I'll try to create a repro for this issue outside of your demo but if you could provide a build of your application that where printf ends up in the debug output or displays a console window containing output from printf, that would be massively helpful.

Thanks

Kier

marzer commented 7 years ago

No worries. I'll get you an updated build with these changes a bit later today. Thanks Kier.

marzer commented 7 years ago

Ok, here's an updated build (source).

Having said all that though, today I saw the following in the error log: [PxgNarrowphaseCore.cpp:1068] Force buffer overflow detected, please increase its size in the scene desc! Indeed, increasing the size of sceneDesc.gpuDynamicsConfig.forceStreamCapacity by a factor of 2 stops the issue from occuring when the game is run using the saved world state in this build, but the cataclysmic failure state PhysX enters into is pretty full-on. Perhaps a dynamic growth strategy could be used (or opted-in to) for that particular buffer?

kstorey-nvidia commented 7 years ago

Thanks for the quick turnaround. I'll keep on with this today and hopefully I'll find the problem soon.

The memory management in GRB is still WIP and a bit rough around the edges. For the short-term, the solution is going to be warnings that a buffer isn't large enough and, rather than crashing, dropping contacts/constraints so you might see things drop through the floor or explode but at least the game won't crash. Many of the potential overflows are already handled correctly but there are a few that haven't been handled correctly still due to that code still being actively worked on and time constraints for release schedules. This isn't too dissimilar to some of the earlier approaches when using the SPUs on PS3.

Long-term, we will be working on improving resource management and reducing the need for users to define appropriately-sized buffers for the user to work in. However, while making these buffers resize would be fairly straightforward with a CPU-only engine, it is actually quite challenging with a GPU engine because the obvious ways to implement this would introduce additional sync points that would hurt performance. The ideal solution would be something that can be handled entirely within the GPU code or if we can somehow estimate the upper bound of memory requirements based on data we have available on the CPU (e.g. number of colliding pairs etc.) - that sounds feasible to me.

However, I'm fairly convinced this rare soft crash isn't one of those buffers overflowing because I increased them all by a factor of 2x in your game and I was still able to reproduce the issue. The crash also occurs in a kernel that is known to handle overflows correctly and I was able to reproduce the issue in a version of the kernel that disabled outputting data. Will let you know once I have more information.

marzer commented 7 years ago

Yup, you're right to suspect it's not the buffer overflowing; I just experienced the crash even with a significantly higher value for sceneDesc.gpuDynamicsConfig.forceStreamCapacity (all it did was eliminate the warning). Ho-hum, the search continues. Good luck!

kstorey-nvidia commented 7 years ago

Just a quick update - my printf has already told me what's wrong :). I have no idea why it's wrong or how to fix it yet, but at least I now know which 20 line function is misbehaving.

kstorey-nvidia commented 7 years ago

OK. The source of the issue is a NAN transform, which leads to a comparison between -FLT_MAX and NAN returning false and an index being left uninitialized, and then that index is used to index into the set of vertices. It's easy to ensure it doesn't crash - just initialize this variable to something sensible (I've initialized it to 0xffffffff at the moment to ensure it's a 100% crash when this happens). Now trying to trace back the source of the NAN to see if there's something being generated internally or something coming from your end. Wherever the bad number originates, I'll make sure that this code is robust to this case.

marzer commented 7 years ago

Hmmmn. Looking at my 'explosion' code I suppose it's possible that very rarely a rigid body's position is within FLT_EPSILON distance of the explosion epicentre, so the length calculation for the attenuation would end up as a division by zero. Since I'm building with floating point exceptions off and floating point mode set to /fp:fast, this would just happily get passed through to PhysX.

I'll add a check in that routine and see what I come up with.

kstorey-nvidia commented 7 years ago

OK. Latest update. It looks like the NAN is coming from within your code. The force applied by the explosion logic is NAN. That leads to velocity becoming NAN, which then gets integrated into a NAN transform, which in turn leads to the transformed verts used in the narrow-phase being NAN. The result of all of this is that some code that is not robust to NANs ends up returning a garbage value as an index, which leads to a bad memory access. I'll fix up that code to make sure it won't crash if it gets given a bunch of NANs. However, I'm afraid that the object in question will have blasted off to infinity in this case and will probably take objects with it. Still, it shouldn't crash.

The NAN enters PhysX through the following callstack::


    PhysX3_x64.dll!physx::Sc::BodyCore::addSpatialAcceleration(physx::shdfnd::Pool<physx::Sc::SimStateData,physx::shdfnd::ReflectionAllocator<physx::Sc::SimStateData> > * simStateDataPool, const physx::PxVec3 * linAcc, const physx::PxVec3 * angAcc) Line 191   C++
    PhysX3_x64.dll!physx::NpRigidBodyTemplate<physx::PxRigidDynamic>::addSpatialForce(const physx::PxVec3 * force, const physx::PxVec3 * torque, physx::PxForceMode::Enum mode) Line 370    C++
>   PhysX3_x64.dll!physx::NpRigidDynamic::addForce(const physx::PxVec3 & force, physx::PxForceMode::Enum mode, bool autowake) Line 275  C++
    Yggdrasil.dll!Yggdrasil::PhysicsManager::Impl::RadialForce(const mathfu::Vector<float,3> & center, float magnitude, float radius, bool wobble) Line 964 C++
    Yggdrasil.dll!Yggdrasil::PhysicsManager::RadialForce(const mathfu::Vector<float,3> & center, float magnitude, float radius, bool wobble) Line 1164  C++
    Bricks.exe!Explosive::Explode() Line 1165   C++
    Bricks.exe!ProjectilePool<Grenade,10>::ForEach(const std::function<void __cdecl(Projectile *,bool &)> & func) Line 269  C++
    Bricks.exe!RemoteMineWeapon::OnSecondaryFireStarted() Line 1420 C++
    Bricks.exe!Player::ButtonDown(Player::Buttons button) Line 1637 C++
    Bricks.exe!Yggdrasil::Event<enum Player::Buttons>::operator()(Player::Buttons && <args_0>) Line 159 C++
    Bricks.exe!Yggdrasil::ButtonMapper<enum Player::Buttons>::OnButtonChanged(Yggdrasil::HumanInterface * iface, unsigned int nativeKey, bool down) Line 75 C++

We have sanity checking of user input when using either debug or checked build config of PhysX. It may be a good idea during development to use the checked build as it will warn about illegal values being passed to PhysX through the public API. The checked build is fully optimized so it should be usable in your game.

marzer commented 7 years ago

Ah, you beat me to it. Awesome! After all that it was some lazy math to blame. Oh well, I'm glad to have helped isolate an edge case, at the very least.

Also, top tip on the checked build. I hadn't actually considered it until just now; I just habitually went into 'connect-the-dots' mode with Debug and Release when adding PhysX to the project way back when, and didn't really give it any thought.

TL,DR: I'm a noob.

kstorey-nvidia commented 7 years ago

No problem. Glad you tracked it down on your side. The next release will have some changes that should mean that this code doesn't crash when a body has a nan transform. Best of luck with your game. Look forward to seeing how it shapes up in the future.