NVIDIAGameWorks / PhysX-3.4

NVIDIA PhysX SDK 3.4
https://www.nvidia.com/
2.35k stars 278 forks source link

Access violation when removing rigid body from scene (x64 Release builds only) #25

Closed marzer closed 7 years ago

marzer commented 7 years ago

Since updating to 3.4 I'm experiencing an access violation whenever I remove a dynamic rigid body from the simulation, either by changing it's eDISABLE_SIMULATION flag state, or when I delete the actor. Only occurs when I build in Release mode for x64 targets; all other combinations don't experience this issue.

The calling code seems related to touch events, but the problem occurs even if there is no simulationEventCallback set and eNOTIFY_TOUCH_FOUND/eNOTIFY_TOUCH_LOST are not included in filter pair flags. Also doesn't matter what form of CCD I'm using (or lack thereof), or if I'm using GPU rigid bodies or broad phase.

Access violation: Exception thrown at 0x00007FFE9D6B483E (PhysX3_x64.dll) in Bricks.exe: 0xC0000005: Access violation writing location 0x0000000000000002. occurred (in ScNPhaseCore.cpp, line 2171)

Stack trace: physx_stack_trace
(sidebar: if anyone can tell me a less clunky way of sharing a stack trace for Visual Studio's debugger, I'd be grateful!)

Memory dump: [https://drive.google.com/file/d/0B6cTOEVTlAMJVHhadkV4NGlZREE/view?usp=sharing]()

michellelu-nvidia commented 7 years ago

Hi, When you say "it doesn't reproduce on other configs", what other configs were you trying? Debug , checked or profile? If it doesn't reproduce on debug or checked builds, could you please look at the output from PhysX to see if it is warning about illegal operations. In debug and checked builds, many illegal operations are reported and bypassed, whereas in profile and release builds, this validation is disabled and the illegal operations can occur, leading to undefined behaviour or crashes. If there are no warning, could you please provide us a repro?

marzer commented 7 years ago

The four combinations I'm building are Debug/Win32, Debug/x64, Release/Win32 and Release/x64. I've reviewed the PhysX output and haven't seen anything.

I'll work on getting you a repro soon, but in the meantime, I've been able to narrow it down a bit: Rigid bodies with more than one child 'shape'. I do not experience the crash in any build configuration when my scene is entirely composed of rigid bodies with only one shape, but as soon as I add a rigid body with multiple shapes to the scene (in my specific case, a rigid body with 6 BoxGeometries in a 'hollow box' configuration), I get this crash when I delete it. I figured it was because they were forming collision pairs with each other (?), but ruled this out via the filter shader.

michellelu-nvidia commented 7 years ago

If the shapes are belong to the same body, in PhysX, we have filter out the collision between them so you should not need to using filter shader to avoid collision between shapes within the same body.

please let me know when you have a repro, I will do some investigation ASAP.

marzer commented 7 years ago

Thanks @michellelu-nvidia. Here's a repro.

PhysXCrashTest.zip

kstorey-nvidia commented 7 years ago

How long should I wait for this to crash? I've been running it for a while and it hasn't failed. I used latest 3.4 from GitHub. I did have to change a couple of things:

(1) I switched from /MD to /MT because PhysX 3.4 builds with /MT by default (2) I switched tools to VS2015 because I don't have VS2017

So...I could try to rebuild PhysX with /MD to see if the issue reproduces there. If not, I'm leaning towards it being a VS2017 issue.

marzer commented 7 years ago

On my end it crashes the first time a rigid body is disabled, so practically immediately. I'm also using the latest version of physx - I'll do some digging tomorrow to make sure I've built it with the right settings.

ghost commented 7 years ago

@marzer I tried to reproduce your solution on VS2017 using the newest version PhysX 3.4 (patch release @2212127 ) I updated VS2017 to 15.0.26430.12 (which is the newest version; in your project file version: 15.0.26430.6) and I installed Win10 SDK 10.0.15063 (which is the version used in your project file)

I compiled PhysX 3.4 and PxToolkit.lib now with toolset 141 (Win10 SDK 10.0.15063) on VS2017 using /MT on PhysX, PxToolkit.lib and on PhysXCrashTest.exe as Release x64 console app showed this:

PhysX.cpp:20:[ERROR] Could not find CUDA module containing function 'Saturate'. PhysX.cpp:18:[WARNING] Failed to create functional GPU dispatcher PhysX.cpp:18:[WARNING] GPU solver pipeline failed, switching to software PhysXCrashTest.cpp:121: Simulation running (view it using PVD). Press key to exit.

PhysX initialization succeeded. PhysX used software. I added a printf to Show that really balls were disabled. And not one time a crash occured.

concerning your crash in line 2171 (in ScNPhaseCore.cpp) in function: void Sc::NPhaseCore::lostTouchReports(ShapeInteraction* si, PxU32 flags, PxU32 ccdPass, PxsContactManagerOutputIterator& outputs, bool useAdaptiveForce)

I recognized, that "si" is not checked against a nullptr. So if the input value here is NULL, a crash can occur from that. I suggest to insert a checking for nullptr. However, in this case the address is @ 0x0000000000000002 which is not NULL, so its a similar invalid address as in my (now closed) post https://github.com/NVIDIAGameWorks/Blast/issues/6 where also PhysX3_x64.dll crashed. It seems to me, that its an ID or similar, which is not transformed into an object address.

However, when using the new built PhysX3_x64.dll (toolset v141) on Blast SampleAssetViewer_x64.exe (built with toolset v140), that also immediately crashes when I click LMB to destroy the object.

PhysX_crashed__onPhysX_with_toolset_v141.zip I also noticed, that in file \PhysX_3.4\Source\LowLevelAABB\src\BpBroadPhaseSapAux.h changing BP_SAP_TEST_GROUP_ID_CREATEUPDATE to 0 causes error: "id1" is an unknown identifier in line 1517: && Object!=id1

When using toolset 140 in PhysX3_x64.dll (still on VS2017 using Win10 SDK 10.0.15063) running SampleAssetViewer_x64.exe seems to be alright again. Mixing up toolset is never a good idea.

@kstorey-nvidia I think its not a VS2017 issue. In my case it was a mixing of the toolsets v141 and v140. However, PhysX DLL was built with the /MT option, but that obivously does not allow to use different toolsets in this case.
I suggest to insert checkings for nullptr in all functions, which receive pointer* parameters. Its much much easier also when each function has an exit code. So that determination of the reason for the error can be done much more efficient.

marzer commented 7 years ago

@Sheila69 Thankyou for doing this. I've verified that everything on my machine is built with the same version of the tools, but that happens to be v141 at the moment. Without being able to look at it to confirm (I'm currently dealing with some unrelated issues), I suspect I've built PhysX with /MD. Once I'm able I'll try recompiling PhysX with its vanilla configuration of /MT/v140.

kstorey-nvidia commented 7 years ago

@Sheila69 Thanks also for investigating this. There isn't a legal case where we should reach that code with a null pointer so I'm not really sold on putting null pointer checks in that code. If we are being passed a nullptr to this method, I'd much rather get a crash repro so I can debug and fix what is a potential bug than try to make this system robust to that and potentially mask a bug.

We could add asserts but this particular repro it doesn't seem to fail in debug builds. I do hope its some tools mismatch as you suggested. Regarding /MD vs /MT, PhysX should be usable with /MD - in fact, this is how Epic builds it for UE4 and it works fine for them.

@marzer if you are still able to reproduce this, would it be possible to provide the binaries (exe, dll and pdbs) and I can try to debug those? If possible, could you try wrapping the whole of ScNphaseCore.cpp in #pragma optimize("", off) in this build (assuming that still shows up the issue)? It would make debugging this issue easier because I could then rely on the values the debugger reports for stack variables.

marzer commented 7 years ago

@kstorey-nvidia Roger that, will do. I'm pushing a deadline at the moment so it won't be for a few days, though.

kstorey-nvidia commented 7 years ago

@marzer No problem. Please let me know when you have something. Thanks again - Kier

marzer commented 7 years ago

@kstorey-nvidia I've just rebuilt PhysX with v140/10.0.14393 and now there's no crashes at all, so it seems you're right to suspect a change introduced in VS2017's tools being the culprit. Do you still want a repro to diag that build environment?

AlesBorovicka commented 7 years ago

@marzer we see also very odd crashes within our internal tests with VC2017, we are now trying to add support for VC2017 atm, but there seems to be something wrong with the new compiler. So for now I would advise to stay away from VC2017.

ghost commented 7 years ago

@Borous I don't think that there's something wrong with the compiler. In March, when I started with the new VS2017 I first also thought in that direction:

https://social.msdn.microsoft.com/Forums/vstudio/en-US/7f964193-d250-4761-8fdf-f6decc87bc49/what-are-the-changes-in-toolset-v141-vs2017-compared-to-v120-vs2013-?forum=visualstudiogeneral

But then I discovered, that all the time when the problems occured, also "TIworker.exe" was active, which belongs to Windows Update (WU).

So in the "Group Policies"  Win-R  gpedit.msc  :  Computer Configuration ->  administrative templates -> Windows Components -> Windows Update -> Configure Automatic Update   there activate that and select   "Notify for download and notify for install".     (value 2) instead of the default value, which automatically runs the update. (Note: only possible on Win7PRO/Win10PRO)

Since I swithed that off, the problems did not occur again.

https://social.msdn.microsoft.com/Forums/windowsdesktop/en-US/ec846ba8-4c00-4454-91dc-8d9896473f3d/windows-update-api-?forum=windowssdk

AlesBorovicka commented 7 years ago

Well our internal UT pass on 10+ different platforms and we are not talking about internal compiler error but about a crash in code, that does happen only with optimisations enabled and VS2017 x64. Additionally the code is clearly incorrectly compiled: bool PxBuildSmoothNormals(PxU32 nbTris, PxU32 nbVerts, const PxVec3 verts, const PxU32 dFaces, const PxU16 wFaces, PxVec3 normals, bool flip) { ...... // Compute face normals const PxU32 c = PxU32(flip!=0);

In this code the value in c is a random number, instead of bool the debugger says its 240 for example and when I printf the value in c it is 240 which is very strange and there is no explanation how that could happen. Even if I change the code to: // Compute face normals PxU32 c = 0; if (flipNormal == true) c = 1; printf("c: %d", c); c: 240

So no, not related to any running process, this is incorrectly compiled.

ghost commented 7 years ago

@Borous you use the "functional notation" of the type cast. And concerning the bool yes/no the result is not really wrong, cause 240 is TRUE, cause all values !=0 are TRUE. Only 0 is FALSE. Ok, but on the 2nd part you mentioned, that normally really should not happen.

did you try this: const PxU32 c = (PxU32)(flip!=0); // using c-like cast notation

Did that also occur after a restart of VS2017? If that is the case you should submit this issue to MS Connect.

marzer commented 7 years ago

@Sheila69 Functional cast or no, it's pretty clear from the second example in @Borous's post that there's an error in compilation.

ghost commented 7 years ago

@marzer yes, the standard C11, says "When any scalar value is converted to bool, the result is 0 if the value compares equal to 0; otherwise, the result is 1."
But [..]Visual C++ implements the vast majority of features in the C++11 core language specification,[..] https://msdn.microsoft.com/en-us/library/hh567368.aspx. Not all! However here a bool result is converted to "unsigned int". The standard says: If the source type is bool, the value false is converted to zero and the value true is converted to one. So in this optimization case it might be a difference from the standard, where they optimized it for speed, cause of a known context. Maybe I'm wrong. But for speed it would make sense not always to change a non-zero value to 1, when its later only again checked against zero. However in the case with the "printf" thats not the case. So yes, that seems to be an issue for MS develop team.

I would like to see the disassembly of that part. Then I maybe can even find out more about that. It is very important which instructions were used on this. Maybe "cmp [],0 jcc mov reg, 1" instructions simply were eliminated by a faster code. And internally that maybe solves the context correctly without using the C11 standard. It would be also important to know, which value "flip" contained. Was it 240?

@Borous If it happened somewhere in the PhysX 3.4 code, please tell me, where you placed the code; I could try it on my system to review the full context.

ghost commented 7 years ago

@kstorey-nvidia

I'd much rather get a crash repro so I can debug and fix what is a potential bug than try to make this system robust to that and potentially mask a bug.

But this PhysX module is meant for using in production code, isn't it? Shouldn't it be robust as much as possible? My suggest would be to use #ifdef/#endif blocks for each such checking against NULL, then the developer can enable the robust mode for release only.

kstorey-nvidia commented 7 years ago

It is production quality code. Inputs to PhysX through the public API are validated in checked build, but these checks are omitted in release builds. This is to help weed out incorrect usage in customer code.

However, this is a function that is called internally in PhysX by other parts of PhysX. Inputs to internal functions are validated where needed using asserts. In this case, there is no assert validating that the input is not null, but it would crash if the pointer was null so the assert probably wouldn't add much value in this case. When there are internal errors in core engine code, it is usually better to identify them early, get a bug report and fix the source of the issue.

The alternative of producing undefined/incorrect results with error codes when functionality was skipped to avoid a crash when things go wrong can lead to more pain further down the road. For example, there could be an issue that would have been identified and fixed early in development but, instead, there's now a hard-to-repro crash or some noticeably incorrect behaviour that's been uncovered not long before a game is due to be released. The engine logs are full of warnings that were ignored because everything seemed to work and now you are stuck up the creek without a paddle with producers breathing down your neck for a fix.

As it stands, it seems likely that this issue was caused by a compiler bug in the VC141 tools and that there is not a bug in the code in question.

AlesBorovicka commented 7 years ago

We have atm opened ticked in MS and they investigate the VC141 issue we see. Hopefully soon we will know where the problem is.

marzer commented 7 years ago

Well, with that in mind, and that recompiling it with v140 toolset was a workable solution, I'll close this issue, since it's clearly not the fault of the PhysX team.

ghost commented 7 years ago

Did MS confirm this issue? If yes, is it solved ? If solved, was it solved after Visual Studio 2017 Update version 15.2 (26430.15) ? On which update?

AlesBorovicka commented 7 years ago

Yes, MS confirmed, we are waiting for a patch.

marzer commented 7 years ago

@Sheila69: https://developercommunity.visualstudio.com/content/problem/66047/possible-compiler-bug.html

marzer commented 7 years ago

@Borous FYI I just tried rebuilding the latest version of PhysX with the VS 15.5 preview version of the compiler released today ( _MSC_VER == 1912, _MSC_FULL_VER == 191225805) and it still has the same codegen bug. Ho-hum :(

AlesBorovicka commented 7 years ago

Hmm let me try to ask again about a workaround. If MS does not answer again, I guess we need to figure out the workaround ourselves :-(

marzer commented 7 years ago

It may be worth asking via a different forum: many of the STL and compiler team are active on Reddit, for instance (in r/cpp), you could ask one of them directly? There's a thread about VS 15.4 on there at the moment that has their attention.

ghost commented 7 years ago

I finally found a workaround!

I did two x64 builds (using optimization): one with toolset v140 and one with toolset v141. I compared the disassembly of calling the function "PxBuildSmoothNormals". I first noticed, that this function is only used in Samples.exe, not within the PhysX DLL, so it only affects the samples and not the PhysX engine itself. However, it is only called with a variable parameter in constructor "RenderMeshActor::RenderMeshActor". On all other calls the flip parmeter is always true. The disassembly confirmed all these "1" (=true) settings for this parameter. So I searched for the initialization of "RenderMeshActor". The version of the constructor, which uses the flip parameter, I only found in file PhysX-3.4/PhysX_3.4/Samples/SampleBase/PhysXSample.cpp [version (the current file in the GitHub repo) sschirm PhysX 3.4, APEX 1.4 release candidate update: @2154206 c7a9217 on Jan 6 ] There its used in some cases with/without specifying the already initialized parameter "flipWinding". Unfortunately at one point the register r13b simply is passed for this parameter, although it contained an address before. This cannot be valid. The 240 (=0F0h) was the lowest Byte of that 16byte-aligned address. So, yes. VS2017 v141 did not produce the expected encoding. Its no optimization, as I first thought.

But its wrong also in v140. However, the memory layout also is different, and so it was not causing an exception yet.

original Samples.exe on toolset v140 in func PhysXSample::createRenderObjectFromShape:

...
40180ADD 44886C2448         mov   BYTE PTR  [rsp+000000048h], r13b  << this register was not initialized ; 10th param
40180AE2 89442440           mov   DWORD PTR  [rsp+000000040h], eax ; 9th param
40180AE6 4C896C2438         mov   QWORD PTR [rsp+000000038h], r13 ; 8th param
40180AEB 48897C2430         mov   QWORD PTR [rsp+000000030h], rdi ; 7th param
40180AF0 4C89642428         mov   QWORD PTR [rsp+000000028h], r12 ; 6th param
40180AF5 4C89742420         mov   QWORD PTR [rsp+000000020h], r14 ; 5th param
40180AFA E8510D0200         call RenderMeshActor::RenderMeshActor
...

original Samples.exe on toolset v141 in func PhysXSample::createRenderObjectFromShape:

...
4016E21D 44886C2448         mov   BYTE PTR  [rsp+000000048h], r13b  ;bool flipWinding  still unintialized
4016E222 89442440           mov   DWORD PTR  [rsp+000000040h], eax
4016E226 4C896C2438         mov   QWORD PTR [rsp+000000038h], r13
4016E22B 48897C2430         mov   QWORD PTR [rsp+000000030h], rdi
4016E230 4C89642428         mov   QWORD PTR [rsp+000000028h], r12
4016E235 4C89742420         mov   QWORD PTR [rsp+000000020h], r14
4016E23A E8810D0200         call RenderMeshActor::RenderMeshActor
...

As you might notice, in both cases the QWORD content (an address pointer in this case) in r13 is also passed to 8th param (*faces32). For many cases that is also NULL, so re-using r13b from r13 is ok, but if faces32 is NOT NULL, then the lowest byte of the address in r13 is passed in r13b. (then it is e.g. 240 = 0F0h, from an 16-byte aligned memory address) But this is not safe, cause that lowest byte could also be zero, and so its not always ensured to be non-zero. No optimization. Simply an encoding failure.

in RenderMeshActor.h in class RenderMeshActor : the parameter "flipWinding" is initialized RenderMeshActor(... bool flipWinding=false );

I tried to use it uninitialized, but unfortunately that also failed.


WORKAROUND:

use intead 2 constructors:

class RenderMeshActor : public RenderBaseActor
{
public:
    RenderMeshActor(SampleRenderer::Renderer& renderer,
                    const PxVec3* verts, PxU32 numVerts,
                    const PxVec3* vertexNormals,
                    const PxReal* uvs,
                        const PxU16* faces16, const PxU32* faces32, PxU32 numFaces
                    );
    RenderMeshActor(SampleRenderer::Renderer& renderer,
                    const PxVec3* verts, PxU32 numVerts,
                    const PxVec3* vertexNormals,
                    const PxReal* uvs,
                    const PxU16* faces16, const PxU32* faces32, PxU32 numFaces, 
                                        const int FlipTrue
                    );
        RenderMeshActor(const RenderMeshActor&);
        virtual  ~RenderMeshActor();
};

NOTE: "FlipTrue" is only a dummy parameter, which must be set to "true". For debug this is ensured through assertion.

and in the implementation of the constructor:


RenderMeshActor::RenderMeshActor(   Renderer& renderer,
                                    const PxVec3* verts, PxU32 numVerts,
                                    const PxVec3* vertexNormals,
                                    const PxReal* uvs,
                                    const PxU16* faces16, const PxU32* faces32, PxU32 numFaces
                                 )
{
    bool flipWinding = false;
...
}

RenderMeshActor::RenderMeshActor(   Renderer& renderer,
                                    const PxVec3* verts, PxU32 numVerts,
                                    const PxVec3* vertexNormals,
                                    const PxReal* uvs,
                                    const PxU16* faces16, const PxU32* faces32, PxU32 numFaces, const int FlipTrue // FlipTrue is only a dummy
                                 )
{
    UNREFERENCED_PARAMETER(FlipTrue);
    PX_ASSERT(FlipTrue == 1);
    bool flipWinding = true;
...
}

in PhysXSample.cpp all calls without the dummy param simply use the "flip = false" constructor:

in line 342: RenderMeshActor meshActor = SAMPLE_NEW(RenderMeshActor)(getRenderer(), data.mVerts, data.mNbVerts, data.mVertexNormals, data.mUVs, indices, NULL, nbTris); in line 491: RenderMeshActor meshActor = SAMPLE_NEW(RenderMeshActor)(getRenderer(), verts, nbVerts, verts, NULL, indices, NULL, nbTris); in line 2015: shapeRenderActor = SAMPLE_NEW(RenderMeshActor)(renderer, vertices, totalNbVerts, normals, UVs, faces, NULL, totalNbTris); in line 2109:shapeRenderActor = SAMPLE_NEW(RenderMeshActor)(renderer, vertexes, nbVerts, NULL, NULL, indices_16bit, NULL, nbFaces);

the one with "true" : in line 2036: shapeRenderActor = SAMPLE_NEW(RenderMeshActor)(renderer, verts, nbVerts, NULL, NULL, faces16, faces32, nbTris, true);
(as you see, here also "faces32" is passed, from this value the 240 was coming)

Samples.exe on toolset v141 in func PhysXSample::createRenderObjectFromShape:` now the disassembly for the "true" case using v141 looks like this:

...
401846E5 C744244801000000       mov   DWORD PTR  [rsp+000000048h], 000000001h  <<< true  (FlipTrue param)
401846ED D1EA               shr edx, 1 
401846EF 89542440           mov   DWORD PTR  [rsp+000000040h], edx
401846F3 488B9350030000         mov  rdx,  QWORD PTR [rbx+000000350h]
401846FA 48897C2438         mov   QWORD PTR [rsp+000000038h], rdi
401846FF 48894C2430         mov   QWORD PTR [rsp+000000030h], rcx
40184704 498BCA             mov  rcx, r10
40184707 48897C2428         mov   QWORD PTR [rsp+000000028h], rdi
4018470C 48897C2420         mov   QWORD PTR [rsp+000000020h], rdi
40184711 E82AAA0000         call RenderMeshActor::RenderMeshActor
...

for toolset v141 in RendererConfig.h I also added:

#pragma warning(disable:5026) 
#pragma warning(disable:5038) 
#pragma warning(disable:4623) 
marzer commented 7 years ago

@Borous FYI: https://www.reddit.com/r/cpp/comments/764ats/visual_studio_2017_version_155_preview_released/doe2hqu/

AlesBorovicka commented 7 years ago

Interesting, I will sync the daily build too and try it. Thanks for the info!

AlesBorovicka commented 7 years ago

So, I have tried the 15.5 preview 1 build: https://www.visualstudio.com/en-us/news/releasenotes/vs2017-Preview-relnotes This one actually fixes the reported issue. With this build I was able to run all our internal tests and all of them passed.

marzer commented 7 years ago

Interesting; i tried 15.5 and still got my original issue. Perhaps there's more to it? O_o

Admittedly my testing was not overly thorough (I didn't experiment with any compiler flags etc). I'll have a play around later in the week.

AlesBorovicka commented 7 years ago

Might be the case, but so far for us internally the 15.5 looks promising.

marzer commented 7 years ago

@Borous Good news: After rebuilding PhysX again with 15.5 I'm no longer getting the crash, so it appears that is fixed after all. Hooray!

Not-so-good news: Now I'm experiencing a weird new behavior, again exclusive to x64/Release; everything is being constrained to a cylinder along the X axis. It's really quite bizarre. Here's some videos to show what I mean:

Correct behavior: x86/Release Weird behavior: x64/Release

PhysX was compiled without many changes to the configuration; the only changes made were the following flags (to bring it in line with the rest of the binaries in the project): /MD /std:c++latest /permissive- /bigobj /D_ITERATOR_DEBUG_LEVEL=0 /D_SCL_SECURE_NO_WARNINGS (as well as some additional /wdXXXX flags to counter warnings raised by /permissive-). It occurs with or without GPU rigid bodies. Any thoughts? O_o

AlesBorovicka commented 7 years ago

Good, the the crash is gone!

Do you see the difference also with VC14? The behaviour looks strange.

AlesBorovicka commented 7 years ago

I just tried to run the SnippetHelloGRBWorld (running on CPU no PhysXGpu dll), that does a bit similar to what is on the video. Looking ok x64 release VC15.5 preview. Can you also give it a shot if its ok for you?

marzer commented 7 years ago

The snippet works alright either way, which makes sense; I've since learned the weird behavior is being caused by some bad inlining. Not sure yet if it's my code or bad codegen, but it's definitely happening outside of PhysX :)