NVIDIAGameWorks / PhysX-3.4

NVIDIA PhysX SDK 3.4
https://www.nvidia.com/
2.35k stars 273 forks source link

By calling PhysX's Raycast via OpenMP, multithreading degrades performance. #84

Closed Qinja closed 5 years ago

Qinja commented 5 years ago

I deployed a big scene on PhysX. RayCast is used for ray detection, but after using OpenMP for multi-core acceleration, although the parallel performance is improved, the speed of ray detection is significantly reduced. Below are statistics on the number of threads and time spent per ray.

i7-4790 (max threads:8) 1 thread: 0.055ms per ray. 2 threads: 0.057ms per ray. 4 threads: 0.075ms per ray. 8 threads: 0.104ms per ray.

i9-7980XE (max threads: 36) 8 threads: 0.074ms per ray. 36 threads: 0.442ms per ray.

RayCast is a process of read-only calculations of the scene, why performance is so much degraded in parallel. How can I modify my program to make it better? Thank you.

AlesBorovicka commented 5 years ago

Hi, what PhysX SDK version do you use? Unfortunately each raycast call does also checks for any query updates, that might lock the tree for write changes. If some updates needs to be reflected they are flushed before each query. Now recently we updated this code to avoid locking if nothing changed. The code that you should have is: void SceneQueryManager::flushUpdates() { PX_PROFILE_ZONE("SceneQuery.flushUpdates", mScene.getContextId());

if (mPrunerNeedsUpdating)  <--- this is important

Please check that you have the SDK version with the line above.

Qinja commented 5 years ago

Hi, what PhysX SDK version do you use? Unfortunately each raycast call does also checks for any query updates, that might lock the tree for write changes. If some updates needs to be reflected they are flushed before each query. Now recently we updated this code to avoid locking if nothing changed. The code that you should have is: void SceneQueryManager::flushUpdates() { PX_PROFILE_ZONE("SceneQuery.flushUpdates", mScene.getContextId());

if (mPrunerNeedsUpdating)  <--- this is important

Please check that you have the SDK version with the line above.

Thank you . My PhysX version is 3.4.2 , I confirmed the presence of the if (mPrunerNeedsUpdating) line in the source code. I didn't make any locks after setting my own scene, and I didn't write the scene again. The scene was not modified during the entire RayCast process. I am now trying to add a write lock. =====> According to my test, it is useless to add a write lock to the scene after initializing.

AlesBorovicka commented 5 years ago

If the scene was not modified, the flushUpdates should early exit and there should be no locking. The raycast should then be as you said just a read only operation. Could it be, that with so many threads lots of context switches happen?

Qinja commented 5 years ago

@Borous I don't know which context you mentioned. For the CPU, the entire program uses the CPU for calculations, from a single thread to the largest thread supported by the CPU. The single thread is the fastest, the more threads, the slower the speed, in the case of no more than the maximum thread, the context Switching doesn't make performance drop so much. For PhysX, I only created one PxPhysics, PxFoundation, etc., and there is no cost to switch contexts. But this also reminds me that I can try to create a PhysX scene with the same number of threads to do RayCast for each thread, if there is no better solution.

Qinja commented 5 years ago

I did a simple test, here is my code and the console results.

void TestMP(int thread_count)
{
    clock_t start, end;
    PxVec3 s(1.0, 1.f, 1.0f);
    PxVec3 dir(1.0, 1.0, 1.0);
    dir.normalizeFast();
    start = clock();
#pragma omp parallel for num_threads(thread_count)
    for (int i = 0; i < 100000; i++)
    {
        scene_mgr->RayCast(s, dir);
    }
    end = clock();
    printf("T%d : %f\n", thread_count, 1.0f * thread_count * (end - start) / 100000);
}
void main()
{
        //
        .........Prepare the scene
        //
    for (int i = 1; i <= 8; i++)
    {
        TestMP(i);
    }
}
T1 : 0.027110
T2 : 0.029180
T3 : 0.032850
T4 : 0.037040
T5 : 0.038450
T6 : 0.043020
T7 : 0.045360
T8 : 0.051200

It's easy to do this test. Could you give me some advice after testing. Thanks a lot.

AlesBorovicka commented 5 years ago

Hmm there will always be some overhead by the threading and you do not measure directly the raycasts but an average from the total time,

Qinja commented 5 years ago

@Borous I also think that threads have overhead, so I used some other calculations instead of raycasts such as while(i<100000)i++; The results verify that the thread overhead is not so large. I also used the VS built-in CPU profiler tool, and the results also show that as the number of threads increases, the CPU time spent on a single raycast is getting bigger and bigger.

PierreTerdiman commented 5 years ago

printf("T%d : %f\n", thread_count, 1.0f thread_count (end - start) / 100000);

Why do you multiply the time it takes by thread_count?

Qinja commented 5 years ago

@Borous
n threads handle transactions of the same size, theoretically the time spent should be 1/n of a single thread. Multiply n to get the efficiency of multiple threads. The problem I encountered at the beginning was that on a 36-core CPU, the CPU cost was reduced by 40% compared to the 8-core CPU. If you multiply the number of threads, you can convert to a single core to compare the efficiency of the multicore.

Qinja commented 5 years ago

For example, it took 10s under a single core, but it took 8s under a dual core. Although time is reduced, in theory it should reach 5s. If the two data are close, then I think it might be the reason for the thread switch. But in fact, the 36 cores have tripled the energy consumption of the 8 cores, but only reduced the time spent by 40%. There are certainly other reasons. @Borous

PierreTerdiman commented 5 years ago

You lost me a bit.

10s to 5s with 2 threads is a theoretical best case that rarely (if ever) happens in practice, since it ignores any overhead from managing the parallel code.

If I look at your numbers above and I don't multiply the timings by the number of threads, I see:

T1 : 0.027110 T2 : 0.029180 / 2 = 0.01459

Thus the speedup with 2 threads is 0.027110 / 0.01459 = 1.85. Which is close enough to the theoretical limit of 2. That sounds ok ?

Qinja commented 5 years ago

@Borous Yes.But we should also pay attention to the 8 threads

T1 : 0.027110 T8 : 0.051200 / 8 = 0.0064 T1 / T8 = 4.23. Not only that. This data is even worse when 36 threads are available.

I understand what you mean. I will do some experiments next, try to eliminate the overhead of the thread itself, to verify the specific reasons for this time gap. I think at least if I use 8 or 36 PxScene and also use OpenMP, but each thread independently executes Raycast in its own Scene, the data will be much better than it is now. Thank you for your patience. If there are some results in the later tests, I will tell you the first time, no matter what the reason is. Thank you.

PierreTerdiman commented 5 years ago

(Note that I am not @Borous :))

Yes perfect scalability is difficult, and adding more cores/threads doesn't always give speedups. There are various reasons for that. For example things like the L2 and L3 caches are often shared by all cores. So effectively the more threads you add, the less cache space you get for each of them.

Qinja commented 5 years ago

Ok, thank you both. I should calm down, it is time to experiment with the real reason. If it's really because of the thread, I will be more frustrated, which means that the performance improvements I can make are even smaller.