NVIDIAGameWorks / PhysX

NVIDIA PhysX SDK
Other
3.16k stars 796 forks source link

Can PhysX take advantage of 2+ cards with NVLink? #364

Open geocosmite opened 3 years ago

geocosmite commented 3 years ago

Are there any advantages for PhysX 4.0 (or later versions) for cases where two or more GTX 3090 boards are connected via NVLink?

Or will the PhysX execution for a given process be limited to just one card even when two or more are connected with NVLink?

I am particularly interested in knowing whether a single PhysX instance will be able to address significantly more GPU RAM with 2+ cards that are connected by NVLink compared to using a single card.

My application involves simulating rigid body deposition of granular media (think sand in an hourglass) and I’m running into serious RAM limitations with my existing single RTX 2080 Ti card.

Thanks in advance for your time in answering my question…

kstorey-nvidia commented 3 years ago

PhysX doesn't support multiple GPUs simulating a single scene. It can make use of multiple GPUs by assigning different scenes to different GPUs. Using multiple GPUs to simulate a single scene is something we would love to provide support for, but it is still an open research topic for us.

Could you quantify the level of complexity you want to simulate e.g. how many bodies. For simulating granular materials, a particle solution (as is being cooked up in PhysX 5) might be a more suitable solution.

If you can provide more details about your use-case, we might be able to suggest some potential courses of action for you to try.

geocosmite commented 3 years ago

Thank you very much for your rapid, informative response!

Regarding the number of bodies, ideally we'd love to be able to deposit hundreds of thousands of clasts with complex shapes. The simulated rubble pack below includes ~22K clasts where the geometries are derived from microCT scans.
salt_deposit2

We use V-HACD to create agglomerated collision hulls that do a pretty decent job of approximating the triangular surface meshes that represent clast geometries. In some cases over 100 convex hull forms are used to represent the clast forms such as in this example: hulls

In this particular application, we are simulating the formation of rubble deposits associated with the collapse of chambers in a salt dome that is being used to store radioactive waste. The main objective is to use the pore space that we simulate as input for computational fluid dynamics codes that simulate the potential for transport of radioactive waste through the pore fluids. I can deposit around 9K of these grains using my RTX 2080 Ti before running out of memory. I ended up using the CPU for dual Intel Xeon® E5-2697 v4 CPU workstation and the run took about 16 days to finish. I estimate that it would have taken about a quarter of that time or less if I could have used a reasonably high end GPU.

We are extremely interested in the new particle solution coming in PhysX 5 but more as a means to consider deposition of sediments in moving fluids such as water and air (the particles would represent the moving fluids). An important aspect of what we are trying to do is to maintain a faithful representation of the complex geometries of the clasts so a particle representation, even when "welded" together, isn't optimal.

Below are a few more animations and images that I hope will give you a sense of what we are using PhysX to simulate.

A simple simulation where we drop sand grains into a container and then induce tighter packing arrangements by shaking and dropping an invisible piston: https://geocosm.net/wp-content/uploads/2020/04/shakeLR.mp4

Clipped regions from the interior of salt rubble deposits with variable extents of induced grain rearrangement: salt_rubble_cuboids

Vertical cross sections through the rubble deposits above: salt_pack_sections

Here is a case where we froze and removed clasts from the scene in an attempt to get around the GPU RAM constraints. As you will see at the end we failed! https://geocosm.net/wp-content/uploads/2020/11/cybcli_dep.mp4

Before I sign off I'd like to express my profound appreciation to NVidia for making PhysX accessible to groups like ours. It is an amazing powerful system that has many fascinating applications that go beyond games. And thanks very much to you and your colleagues as well for any advice that you have to offer in our quest to create accurate and realistic depictions of deposits using PhysX.

kstorey-nvidia commented 3 years ago

What parameters are you using in PxgDynamicsMemoryConfig in PxSceneDesc::gpuDynamicsConfig?

The default settings in these buffers are appropriately sized for small/medium-sized simulations. For the Kapla demo that ships with the SDK, we had to increase these buffer sizes to accommodate more contacts.

If you haven't adjusted these settings, just try uniformly scaling all the values in this buffer besides "heapCapacity" by some constant to see if that solves your problem (e.g. try multiplying them by 8). The default settings only use a relatively small % of the VRAM available in a 2080 ti before dropping contacts.

There should be error messages telling you which buffers overflowed when you get an overflow and contacts are dropped. This should also give you an estimate of how much memory the system thinks it would need for your simulation to avoid overflowing.

geocosmite commented 3 years ago

Below I list the response from our software engineer who is leading the implementation effort:

First I want to mention the actual online documentation for physX 4.1 hasn't been updated correctly so if you compare what is in the online doc's to what I list below, it will be slightly different. They renamed a few parameters slightly and changed some of the default values in the documents vs what is in the actual current code.

Looking in our code, the parameter that you can config in DepositRunOptions.xml does exactly what he mentions. Nvidia only mentions uniformity scaling all the parameters except heapCapacity, which is what the code currently does. The code comment mentioned that that buffer scales dynamically. I did bump that buffer by a hard coded fixed amount though.

Here are the default parameters pulled directly from what Nvidia sets before we change the values.

DEFAULT ORIGINAL VALUES

struct PxgDynamicsMemoryConfig
{
    PxU32 constraintBufferCapacity;    //!< Capacity of constraint buffer allocated in GPU global memory
    PxU32 contactBufferCapacity;    //!< Capacity of contact buffer allocated in GPU global memory
    PxU32 tempBufferCapacity;        //!< Capacity of temp buffer allocated in pinned host memory.
    PxU32 contactStreamSize;        //!< Size of contact stream buffer allocated in pinned host memory. This is double-buffered so total allocation size = 2* contactStreamCapacity * sizeof(PxContact).
    PxU32 patchStreamSize;            //!< Size of the contact patch stream buffer allocated in pinned host memory. This is double-buffered so total allocation size = 2 * patchStreamCapacity * sizeof(PxContactPatch).
    PxU32 forceStreamCapacity;        //!< Capacity of force buffer allocated in pinned host memory.
    PxU32 heapCapacity;                //!< Initial capacity of the GPU and pinned host memory heaps. Additional memory will be allocated if more memory is required.
    PxU32 foundLostPairsCapacity;    //!< Capacity of found and lost buffers allocated in GPU global memory. This is used for the found/lost pair reports in the BP.

    PxgDynamicsMemoryConfig() :
        constraintBufferCapacity(32 * 1024 * 1024),
        contactBufferCapacity(24 * 1024 * 1024),
        tempBufferCapacity(16 * 1024 * 1024),
        contactStreamSize(1024 * 512),
        patchStreamSize(1024 * 80),
        forceStreamCapacity(1 * 1024 * 1024),
        heapCapacity(64 * 1024 * 1024),
        foundLostPairsCapacity(256 * 1024)
    {

Our implementation is listed below, where gStrGeneralPhysxOptions.iGPODynBufferMultiplier is defined in a DepositRunOptions.XML file that we pass to the code at run time. For instance, our default currently is 50

    //memory settings for GPU
    sceneDesc.gpuDynamicsConfig.constraintBufferCapacity = 32 * 1024 * 1024 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.contactBufferCapacity = 24 * 1024 * 1024 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.tempBufferCapacity = 16 * 1024 * 1024 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.contactStreamSize = 1024 * 512 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.patchStreamSize = 1024 * 80 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.forceStreamCapacity = 1 * 1024 * 1024 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.foundLostPairsCapacity = 256 * 1024 * gStrGeneralPhysxOptions.iGPODynBufferMultiplier;
    sceneDesc.gpuDynamicsConfig.heapCapacity = 64 * 1024 * 1024 * DEFAULT_GPU_HEAP_CAPACITY_MULT; (SET at 30 but buffer should grow dynamically)

Note sceneDesc is PxSceneDesc sceneDesc(scale); so it is defined as PxSceneDesc what the Nvidia person mentions above.

In the spring when I was running some experiments on this it didn't seem like there was an intuitive linear relationship between this number and how much memory could be used. Since we put in the ability to run the simulation in a deterministic mode late in the project, that might help in finding a relationship between that memory scale value and how much memory can be utilized. It is possible that depending on how the grains dropped and aligned, a greater number of contacts and interactions could have caused different memory usage depending on the simulation run. I know at points I bumped the value from 50 to higher values but I just couldn't see a relationship at that point. It almost felt like some other buffer or value, list was overflowing. If I remember correctly most or all of the code relating to those memory buffers in in the proprietary Nvidia GPU dll so source code isn't available making it very difficult to find out what ultimately was the limiting factor.

I hope this the information you were looking for. From everything I had read and could find out the buffers above should be the way to increase GPU memory utilization, and it did seem to increase GPU memory usage, but it felt like there possibly was something else that was limiting the ability to fully utilize GPU memory. Closed source nvidia GPU dll makes it difficult to determine the exact cause of the memory limitations.

kstorey-nvidia commented 3 years ago

I know at points I bumped the value from 50 to higher values but I just couldn't see a relationship at that point

I would not increase heapCapacity, but others should be increased. HeapCapacity is the page size of our custom heap allocator. Making this larger might make CUDA allocations start failing because dynamically-allocated data will be sub-allocated from enormous contiguous blocks of memory.

It would be great if you could help me reproduce this issue. It is possible that there are some edge cases that have not been handled, but I have locally simulated scenes with 250k+ rigid bodies and bumping up the values in those gpuDynamicsConfig does allow those scenes to simulate on GPU. Bumping these numbers up by 50x is pretty large and I'm not quite sure why we would need that much memory based on the video you showed earlier. It would be great to understand what's causing memory pressure in this scene.

What are the size of the shapes and what are the contact offsets?

TheBubDev commented 3 years ago

Hi kstorey, (TLDR version is at the bottom) My name is Jon, I did some work for the original thread starter last spring. Let me give you a little back story first. A good amount of work was done before I came on board to try and get the Physx 4.x code working in the simulation, so there are some blank spots in my knowledge about how certain decisions were made by the previous developers creating the collision meshes etc, Also I haven’t worked on the code since spring so I may end up recalling some of this information incorrectly since I’m a little rusty. I come from a gaming background a proprietary engine back in my Disney Interactive Days, and then some UE4 more recently, I was the guy at work telling artists and modelers to make their materials smaller, visual and collision meshes simpler(I was that guy).

These meshes are much more complicated, than I had previously worked with( in game physics) and also very small sub millimeter. The simulation is set to run at centimeter scale, I couldn’t find any references running Physx at a smaller scale than centimeter so I wasn’t sure what would happen setting it smaller so I stuck with centimeter scale. Also there was some previous up scaling of the meshes( done prior to my arrival) that moved some sizes into that range.

Since the gpu has a 64 vertex limit but the grains are very detailed, each grain has been broken up into multiple convex hulls with a max of 64 points(This was done prior to my involvement it seems to work but I don’t know about the methodology arriving at this point for the data). These multiple meshes are then cooked and attached to each actor. Some simple grains(actors) have only one mesh, but I’ve seem grains that include 33. So the vertex count for individual grains in the simulation can run anywhere from 64 to about 2k points.

One of the things that is very different in this simulation vs what I’ve seen in game dev is that in most physics game development sim usage objects are allowed to settle and sleep in a fairly aggressive manner, at the same time keeping collision geometry as simple as possible OBB’s, Spheres, and custom modeled meshes. During this simulation these grains are stacked inside of a cylinder, and at certain points the cylinder is shaken both horizontally and vertically, as well as compressed from above by a piston. At these time it will make every actor in the simulation active with multiple contact points. Also I noted as more and more objects are placed vertically most objects aren’t moving into a sleep behavior. I cant remember the option but I had tried various settings one was mentioned as experimental but better for vertical stacking but the results didn’t appear noticeably different. The tolerances for the original design are very tight so the grains can compact, but also trying to keep interpenetration of objects to a minimum.

I’m not completely familiar with the total range of sizes the grains could be, it can vary depending on the types of sand grains scanned, the examples I ran getting the sim up to speed were in the 0.01 mm to 1 mm range. These actual sizes can be scaled at run time, in most cases a scale factor of 10 was used so the simulated object size ran from about 1mm to 10mm on average(although these objects were also sub divided to stay under the 64 point GPU limit). I looked into the default value for contact offset and if my notes are right I believe the default for PhysX was 2.0f when I looked in the debugger trying to decide what to try, the last value I set for contact offset was 0.25f which is also user configurable. I made these values user configurable because it was a bit nebulous trying to decide what values to use, trying to get everything stable, and each actor as close as possible. I believe my thinking at the time was since 2.0f was being used for centimeter scale, and our objects were closer to mm scale 1/10 the size of 2 would be the range to try, so I ended up settling on 0.25f after some trial and error. In trying to minimize interpenetration of objects I changed reset Offset to a small non zero number 0.002f, not sure if this is problematic. Also at one point I tried scaling contact offset based off the individual objects size(radius) but I couldn’t detect and or wasn’t sure if I was gaining anything by making the value a function of the object size.

I’m building a new version for the original thread starter to set heapCapacity back to default. What you mentioned about the page size made total sense. I had assumed from the name that heapCapacity was just the initial size and it would grow by another internally set amount. Once I’ve sent new exe files with the default to heapCapacity restored if the problem still persists I’ve mentioned to them about putting together a test case for you. I think the only down side is it can take many hours even running on GPU( Although you probably have the latest and greatest 3090 :) ) so maybe quicker.

TLDR version. Original object size used during dev development: 0.01mm to 1.0mm with some larger and smaller outliers

Collision objects with fairly high poly count: 64 to 2k verticies

Scaling applied to above object meshes usually 10x: So simulation obj size 0.1mm to 10mm

Contact Offset Currently Used 0.25f ( Guess since 2.0 was default value using CM scale, so 0.2x for mm objects?)

Reset Offset Currently Used 0.002f

Changing heapCapacity to default

If problem still persists try and potentially package up a test case for you to run.

I do have one question, is there any chance in the future that the cuda PhysX dll will be open sourced that would make tracking down issues like this so much simpler. Probably not but I figured I would ask

If that doesn’t answer your questions let me know, or if you see mistakes in the values I’ve used. Most were just guesses so it wont hurt my feelings if you think other values will work better

geocosmite commented 3 years ago

A couple of quick comments:

gpu2

kstorey-nvidia commented 3 years ago

Thanks for the details. This is interesting and I think I have some understanding of what is going on.

You are simulating on a tiny scale. I am going to leap to an assumption that you must be using very small time-steps to simulate this case stably.

Have you tried significantly reducing contact offset beyond the level you currently have set it? Let's assume that instead of simulating at 60hz, you are simulating at 6khz, instead of using a 1-2cm contact offset, you could instead use a 0.01-0.02cm offset, which would be something in the region of 10-20x smaller than the value you are currently using. This could potentially reduce the number of contact pairs being generated significantly, but should still behave quite stably provided the time-steps used are sized appropriately.

More detailed explanation below

One of the problems contributing to the large memory usage will be the size of the contact offsets relative to the size of the shapes, but there may be some setting adjustments we can make that will improve this.

Contact offset is used to improve stability of piles/stacks. It allows PhysX to generate contacts between shapes that are separated by a small distance. These contacts do not apply any force unless the velocities of the bodies involved would result in the shapes being penetrated in the next time-step. This aims to ensure that shapes come to rest in a stable state rather than jittering between a touching/separated state.

Contact offsets are defined per shape and the sum of the shapes' contact offsets are used to determine the distance at which contacts begin to be generated.

A 2cm offset simulating objects that are roughly 10cm in diameter means that a stack of these objects doesn't usually pick up contacts with any shapes that aren't realistic candidates for collision.

A 2cm offset simulating objects that are roughly 0.1cm in diameter means that a stack of these objects will generate contacts between a body at the bottom of the stack and also many bodies several layers above in the stack. Most of these contacts won't do anything besides consume memory and compute resources to calculate them because they are too separated relative to velocities and time-step to apply any force.

The default values for contact offset are appropriate for roughly for 30-60hz simulations using Earth gravity. A gravity of 9.8 at 30hz simulation would see the object in question move by slightly over 1cm from rest in a single frame. Default contact offset is 2cm, which works well for 30hz simulation and is arguably maybe a little larger than it really needs to be for 60hz simulation, where objects will only move ~2.7mm from rest in a single step. There's more to contact offset than just making sure that objects don't jitter on surfaces, and having a larger offset is generally a good thing for simulation stability.

If we reduce the time-step such that it is 10x smaller, we can also reduce the contact offset by an equivalent amount and still achieve the same level of simulation stability because the relative distance the shapes will move in a single frame will also be reduced. When simulating very small objects, reducing time-steps to get stable behavior will be a necessity. Reducing contact offset is not strictly required, but doing so should reduce the number of pairs generated between the tiny shapes and therefore improve both performance and memory footprint.

TheBubDev commented 3 years ago

Thanks for the additional information. You are correct we are running small time steps. With the test data I initially started at 100hz and could observe visible instability, 500hz seems to be about where it starts to become stable. I was definitely trying to find the smallest time step where the simulated objects became stable but also maximized performance. Before your above comment I was looking at this more as a performance\simulation stability trade off, but I will now definitely consider this a performance\simulation stability\memory balance. I have all of these values exposed in configuration files for the simulation that users can adjust. I’ll look closer at the numbers we are using taking into account the information you provided above.

I do have one other random question you might know, and was curious about in relation to windows and the GPU memory it shows in Task Manager. The task manager in windows shows a Shared GPU memory value which for my 2060 Super shows 2x my dedicated GPU memory, I would expect shared memory from a non dedicated GPU but not with dedicated GPU’s. Does PhysX and CUDA use shared memory, I know from a performance standpoint it would be very slow, but I was wondering if allocations beyond dedicated GPU memory would actually start to use that shared memory or if it’s just ignored?

Thanks for all of the additional information.

cadop commented 3 months ago

Is this still true, that multiple GPUs can't be used for physics of a single scene?

vreutskyy commented 3 months ago

Hi @cadop. Yes, it's still true. Just in case, the latest version of PhysX is here: https://github.com/NVIDIA-Omniverse/PhysX