PixarAnimationStudios / OpenUSD

Universal Scene Description
http://www.openusd.org
Other
5.45k stars 1.13k forks source link

Autodesk: Optimize RenderIndex removal #3025

Open erikaharrison-adsk opened 1 month ago

erikaharrison-adsk commented 1 month ago

Description of Change(s)

There are two parts to this optimization.

There are mainly two issues in SdfPathTable and Hd_SortedIds separately.

  1. SdfPathTable: The one-way linked list in the node results in a linear search. This will cause O(n^2) time when deleting items one by one. To fix this problem, we replace it with double-way linked lists. Besides, storing the exact value in the data structure may trigger the page fault more frequently, especially when data size increases. We then replace that with a fixed-size pointer.

  2. Hd_SortedIds: The swap and pop-up strategy for deleting single items also causes problems. This will break the order for most cases, and as a result, subsequent removals will fall into the linear search, or a HdRenderIndex::GetRprimIds() will require sorting the items. To resolve this problem with limited influence, we add an index container for fast query, while keeping the current partial sort strategy for fast insert and ranged deletion.

Result Following is a brief view of what our changes bring to USD. We gain a significant improvement (~3.5x) when data size goes to 100k+: image

And now, the difference between with and without SceneIndex simulation looks reasonable, with a ratio of ~2 and both maintaining a linear relationship with the amount of data: image

As this change is supposed to cause no regression, the following are the results in even tiny data size:

image image

Considering the extra data and the unit (ns), some tiny differences are reasonable.

Reference

const int testAmount = 1 << 16;
const std::string prefix = isFlattened ? "/prim_" : "/root/prim_";
SdfPathVector pathList;

// Prepare data set.
for (int i = 0; i < testAmount; ++i)
{
    SdfPath id(prefix + std::to_string(i));
    pRenderIndex->InsertRprim(HdPrimTypeTokens->mesh, pSceneDelegate, id);
    pathList.push_back(id);
}
pRenderIndex->SyncAll(&tasks, &taskContext);

// Timing block
{
    for (int i = 0; i < testAmount; ++i)
        pRenderIndex->RemoveRprim(pathList[i]);
}

pRenderIndex->SyncAll(&tasks, &taskContext);

Corresponding topic in AOUSD:

Fixes Issue(s)

jesschimein commented 1 month ago

Filed as internal issue #USD-9502

jesschimein commented 1 month ago

/AzurePipelines run

azure-pipelines[bot] commented 1 month ago
Azure Pipelines successfully started running 1 pipeline(s).
tcauchois commented 1 month ago

Thanks for the submission! I'm trying to do a close read of these because these changes make a lot of engineering tradeoffs that I'm unsure about, in terms of adding memory and indirection, especially with SdfPathTable since that datastructure is used in a lot of ways in a lot of places in our code.

I think your analysis of Hd_SortedIds is correct, though I wonder if just having a "remove list" would be more efficient; e.g. when you remove, queue up to "N" individual removals, and when you get above some threshold or try to read you process the removals by doing the correct memmoves.

For SdfPathTable, this datastructure is used in a number of very different places. I'm curious if your analysis was specific to one of the path tables in hydra? e.g. the HdRetainedSceneIndex path table, or the HdSceneIndexAdapterSceneDelegate path table? I'm also wondering how the performance breakdown differed in terms of storing value on the heap vs adding the "parent" pointer to _Entry vs any kind of alignment differences? Storing the value on the heap is something we can (and probably should) do at the callsite. But it sounds like the slowdown that prompted the parent pointer was in SdfPathTable::erase, and not in any traversal in hydra code, correct?

adskWangl commented 1 month ago

Thank you very much for your review! Certainly there is more than one solution for these technique problems and we are always open for discussion.

I think your analysis of Hd_SortedIds is correct, though I wonder if just having a "remove list" would be more efficient; e.g. when you remove, queue up to "N" individual removals, and when you get above some threshold or try to read you process the removals by doing the correct memmoves.

The reason why we don't adopt "remove list" is that:

  1. It cannot completely resolve the bottleneck. As this problem is caused by disordered ids, to sort the ids everytime before applying the "remove list" would definitely be expensive and unaffordable. Evaluating and trading off for an appropriate acceleration strategy is quite complicated and use-case dependent, resulting in maintenance difficulties. Also, "remove list" is not free and itself may become another bottleneck.
  2. USD already has similar pending lists in many other places like SceneIndex. Adding another "remove list" in a fundamental data structure with async processing introduces more complexity.

We actually have another idea to replace the current std::vector with a single std::set and offer range access through std::ranges. This could be efficient but requires changes in APIs. Thus we'd like to get your input on this idea.

For SdfPathTable, this datastructure is used in a number of very different places. I'm curious if your analysis was specific to one of the path tables in hydra? e.g. the HdRetainedSceneIndex path table, or the HdSceneIndexAdapterSceneDelegate path table? I'm also wondering how the performance breakdown differed in terms of storing value on the heap vs adding the "parent" pointer to _Entry vs any kind of alignment differences? Storing the value on the heap is something we can (and probably should) do at the callsite. But it sounds like the slowdown that prompted the parent pointer was in SdfPathTable::erase, and not in any traversal in hydra code, correct?

Sorry if I mislead you with the test data. The final analysis is actually made in the data structures themselves once the bottlenecks are located. We write our own benchmark cases similar to test cases like TestSdfPathTable. For storing value in the _Entry as a pointer instead of raw data, the most improvement is that there would be fewer page faults and also some small improvements when traversing. As only "key" matters when traversing, loading actual data is not necessary. We've also tried to create _Entry pool as its size fixed and known at compile time but unfortunately there was no big difference. Another thing I want to highlight is that RenderIndex has dual data storages, one from original implementation and another one is for SceneIndex emulation. Sometimes no downgrade found in traversal doesn't mean the SdfPathTable has no problem but only because it uses another data structure. Could you please offer more cases for analysis? Specific cases with source code would be really helpful. Thanks!

tcauchois commented 2 weeks ago

Just to continue the conversation with Hd_SortedIds: this class is meant to provide a sorted set of paths for us to use with this class: https://github.com/PixarAnimationStudios/OpenUSD/blob/release/pxr/imaging/hd/primGather.h. The interesting functions there are Subtree() and PredicatedFilter(). These output (filtered) path sets are then usually used for other operations, for example: All prims -> filter to certain include/exclude paths -> for filtered prims, generate draw items -> add to draw batch.

I think for Subtree, a tree structure might be preferable, although we've had a lot of issues with std::set in the past because its API coupled with its balancing invariants just make it really slow for mass edits.

For PredicatedFilter, we dispatch sections of the scene to TBB, so we need some kind of random access iterator, and that's the reason Hd_SortedIds was originally written (to provide a sorted vector). I'm very open to other approaches, though.

Our typical remove calls involve unloading a sub-scene, so if we're not tearing down hydra entirely we'll be deleting tens of thousands of prims at a time. They will share a common path prefix, so usually they will be continguous in all of these datastructures.

Other questions: which malloc implementation are you using? And is it possible to share any tests with us, like a remove-performance unit test that has the patterns you care about?

This PR has three pieces:

Whatever I land will have to be a piece at a time, I think, so that's one of the reasons I'm interested in which pieces had the biggest performance effect.

Thanks!