Loading too many objs slows down the simulation significantly

huihuaTRI commented 4 years ago

Problem statement: Loading too many obj mesh files for collision geometry will slow down the simulation significantly.

The users try to use auto convex decomposition process to automatically generate convex collision meshes files for an object since Drake supports convex mesh objs. The auto convex decomposition process really adds the power to pragmatically generate new items quickly with very detailed collision geometries. However, for an object with non-trivial contour, this process could potentially generate hundreds of objs.

SeanCurtis-TRI commented 4 years ago

Thanks for posting the issue. :)

SeanCurtis-TRI commented 4 years ago

Recap of original problem characterization

Some details from the original (unavailable) slack message (emphasis mine):

I loaded a [redacted] model into the simulation. It's a simple model that has three bodies and two joints. The non-trivial part is that it has about 400 collision mesh files (objs). These objs are generated from an auto convex decomposition process that break the [redacted] model into small pieces. The problem is: after loading the [redacted] model, the simulation slows down significantly to 0.22 real time factor. Without the [model], the simulation could run 2.0 real time factor.

I used valgrind to profiling the [redacted] simulation. It turns out that 57.3% of the total time is on void fcl::detail::supportConvex<double>().

It is worth noting fcl::detail::supportConvex is a function in FCL that is part of the generic iterative intersection algorithm (minkowski portal refinement) for contact between two convex shapes. Given a direction vector, it returns a point on a shape that is the farthest in that direction. It is known that FCL's supportConvex does a linear search on all vertices for the Convex shape type.

The simulation used a discrete MBP with a timestep of 2.5 ms.

For the simulation to run with a real time factor of 2X, the full computation (geometry + dynamics) needed to completely execute in no more than 1.25 ms. This is a generous ceiling and, due to other overheads in the simulation, is probably smaller. But this number will be significant (see below). On the other hand, with that fixed time step, a real time factor rate of 0.22 means that it's taking something along the lines of 2.5 / 0.22 = 11.25 ms to get from one discrete update to the next.

Initial hypotheses and results

Hypothesis: high resolution meshes causing excessive cost in `supportConvex`

Given the above information, I made the assumption that the objs were high resolution (frequently a feature of automatic generation). Thus, repeated linear searches for a support vertex would scale linearly with the number of vertices. Analyzing the meshes provided the following distribution:	Vertex count	Number of meshes
10	1
18	2
120	340
138	1
208	2
338	1
1442	1
16007	49
17449	4

Table 1: Taxonomy of convex meshes included with the [redacted] model based on mesh complexity.

1/8 of the meshes have more than 16,000 vertices and the remaining have at least 100. It doesn't prove the hypothesis, but it does suggest it's viable.

Garbage meshes

It turns out that the convex decomposition produced garbage data. When this particular analysis was performed, it was assumed that the meshes were meaningful. Only subsequently when trying to reproduce the reported results, were the details of the meshes investigated. The following was observed:

Huge numbers of redundant meshes.
- For the body, one mesh was included multiple times under different names and in the same poses.
- Furthermore, there were meshes that are strictly inset inside other meshes.
Incredibly high resolution with little to no value.Again, only the body was explicitly examined, but the main body of the refrigerator, could reasonably be modeled with dozens of vertices. Instead, it has 16,000+ vertices.

This is very much a case of horrible input. It does raise the question: can Drake handle garbage input like this better?

Implement a straightforward optimization to FCL's `supportConvex` for `Convex` shapes

I modified the Convex class to include knowledge of edge adjacency; given a vertex, it is known which vertices are connected to it by an edge. The supportConvex method then becomes a case of walking edges into the specified direction. For small meshes (few vertices), this would be more expensive; for large meshes this would be much cheaper.

I set up a simple experiment with multiple contacts between tessellated spheres with graduated mesh complexity to gauge the impact of the "edge-walking" scheme. The figure below shows the results

smart_convex Figure 1: Impact of edge-walking over linear search based on mesh complexity.

As expected, with < 32 vertices, the overhead of walking the edges slightly penalizes the algorithm. However, as the vertices increase, the overhead vanishes, and the edge walking provides a distinct benefit; at 1,000 vertices, it takes half the time. While the data doesn't extend to include cases where the meshes have 10,000 vertices, the trends don't suggest that this optimization can make up for the 10X loss of performance observed above (real time factor change from 2.0 to 0.22).

The two horizontal lines provide reference baselines. Sphere primitive (solid black line) is the cost of replacing the sphere meshes with actual spheres. Computing contact between two spheres is trivial and this represents the cheapest possible query. Sphere GJK (dashed black line) uses mathematical spheres in the general convexity algorithm (GJK). The supportConvex method for Sphere is O(1) and the difference between Sphere primitive and Sphere GJK more or less captures the cost of the GJK algorithm (i.e., it represents a floor for how fast the general convexity algorithm can run).

Reproduce performance issues.

Created a simplified scenario approximating the scenario as reported. The model with the 400 convex meshes was loaded and then a single Box was placed in shallow contact with the model. (In the real scenario, it was three boxes in near-contact with the model). I decorated the code with high-resolution timers and assessed the cost of evaluating the point-pair contact query. I performed 50 collision queries on the scenario in one of three configurations:

Using the original convex meshes, with the linear search supportConvex in FCL's master.
Using the original convex meshes, with the edge-walking supportConvex in my branch.
Replacing all convex meshes with tightly-fit boxes. (FCL has primitive tests for box-box intersection and isn't dependent on supportConvex at all.) This serves as the baseline.

Scenario	PoseUpdate (s)	Filtered (s)	False Positive (s)	True Positive (s)	Broadphase (s)	Total (s)
Convex meshes, FCL master	0.004957	0.1243	0.5371	0.1706	0.07759	0.9145
Convex meshes, FCL faster	0.005738	0.1294	0.5807	0.03612	0.08532	0.8374
Box collision	0.003004	0.1189	0.1503	0.0005785	0.06969	0.3425

Table 2: The cost (in seconds) of performing 50 collision queries in the test scenario, broken down by its constituent components.

The constituent components of the query are broken down as follows:

PoseUpdate: The time spent to update the poses of the objects and update the broadphase mechanism. In this case, the poses weren't changed, but the cache was disabled so that the work was done each time.
- Filtered: The time spent on collision candidate pairs (produced by the broadphase algorithm) that ended with the determination that the candidate pair was marked as filtered. Across the 50 queries, 873,650 pairs were generated that were subsequently filtered (56% of all pairs).
- False Positive: The time spent on collision candidate pairs that were not filtered but nevertheless proved to be contact free. Across the 50 queries, 696,150 unfiltered pairs were found to not be in collision (44% of all pairs).
- True Positive: The time spent on collision candidate pairs that actually produced positive contact results. Across the 50 queries, 900 pairs were found to be in collision (0.057% of all pairs).
- Broadphpase: The time spent producing collision candidate pairs in the broadphase algorithm. Across the 50 queries, 1,570,700 collision candidates were generated.

The three rows above show absolute times. The pie charts below show percentages (one pie chart for each row).

Observations

The introduction of edge walking only reduced the cost of the query in this scenario by about 12%.
The total time for each of the rows suggest that (in this scenario)
- Convex mesh with linear search had a cost of 18.3 ms per query. 7.3X larger than the discrete time step (2.5 ms), this guarantees a massive slow down.
- Convex mesh with edge-walking had a cost of 16.7 ms per query. 6.7X larger than the discrete time step. This will still keep the simulation running well below real time rates.
- Replacing convex with boxes had a cost of 6.9 ms per query. 2.8X larger than the discrete time step. We still can't meet real time rates (even with just boxes).
PoseUpdate is a largely negligible cost for now (less than 1% of compute time across all experiments).
For actual collisions (True Positive) the introduction of the edge walking algorithm improved performance by a factor of 4-5. This is largely attributable to the fact that the box was in contact with the highest resolution meshes (the body).
However, even in the worst case, evaluating the true positives accounted for no more than 20% of the compute time (linear search with convex meshes). Dealing with False Positive is the single largest computation (half or more).
- This is about culling efficiency. It is worth noting that FCL has given up BV tightness for update efficiency. An AABB is fit to the underlying shape in the shape's canonical frame. If the shape is translated, that tight-fitting box is simply translated in the tree. If, however, the shape is rotated at all, the object's BV is updated to fit a circumscribing sphere of the canonical BV. So, most of the time, the BV can be inflated. As the aspect ratio of the canonical BV diverges from a perfect cube, the volume penalty on this strategy becomes increasingly horrible.
- It's also worth noting the "garbage" nature of the model significantly contributes to this. When duplicate meshes sit on top of each other, no amount of broadphase culling can help.
In each of the experiments, we basically had a fixed cost of 120 ms across each set of 50 queries (ranging from 34.7% down to 13.6%) of the query time. This is regrettable and is paid with every query despite the fact that the filtering hasn't changed from query to query.
- Although not listed in the table above, across all 50 queries (in each scenario) we spent about 100 ms simply asking if two collisions were filtered. That's 2ms per query. That alone almost takes away the ability to run at real time rates with a discrete time step of 2.5 ms. And again, it's a question that had the same answer every time.

pie_slow_convex Figure 2: Percentages of compute time for convex meshes with linear search supportVertex function.

pie_fast_convex Figure 3: Percentages of compute time for convex meshes with edge-walking supportVertex function.

pie_box_collision Figure 4: Percentages of compute time for box collision objects.

Possible tasks

Finish of convex improvement; doesn't solve the particular problem that triggered the issue, but it's an improvement.
Revisit collision filtering such that:
- Asking if two objects are filtered is cheaper and/or
- The fact that two objects are filtered effects the broadphase earlier. Ideally, the broadphase, we would want the broadphase algorithm to do less work.
Explore mechanisms to give feedback on bad input (similar to that discussed above).

sherm1 commented 4 years ago

Wow -- awesome study and report Sean, thanks. Is it fair to conclude that for this to run in real time we have to fix the crappy input meshes?

huihuaTRI commented 4 years ago

Thanks for the detailed report. It's amazing and I learned a lot from it.

I am surprised that there are so many collision pairs to work within the Filtered and False Positive stages. Are those collision pairs come from checking the whole fridge or just the face that next to the counter. To be more general, my question is whether the Filtered stage will filter out the vertices that are far away from the other object.

SeanCurtis-TRI commented 4 years ago

The filtered and false positive aren't as surprising as you might think.

For 50 queries, we tallied 1,570,700 pairs. For a single query, that is 31,414 pairs. If we assumed the broadphase got trashed by all of the overlapping, redundant geometries and we did a simple O(N^2) pairing of the collision geometries, all we would need is 250 geometries to reach that number (i.e., if N = 250, there are 1/2 N (N + 1) pairings).

SeanCurtis-TRI commented 4 years ago

Addendum to the test case.

After adding convex mesh validation code to FCL (a prerequisite for valid edge walking), it was found that at least some of the meshes used in the evaluation weren't watertight; there were cracks in the mesh. This would have two potential effects:

Cracks in what should otherwise be a watertight topology can, best case, cause the edge walked to be longer. The optimal path would ordinarily traverse the mesh surface across where a crack lies, but can't because there's no adjacency across the crack. Therefore, it takes a longer path "around" the path.
In the worst case, cracks can lead to the wrong answers. Because the edge walking is, essentially, a gradient descent, a crack can create a local minimum that will trap the walk, producing the wrong answer.

Both of those factors may contribute to a reduction in the performance of the edge walking optimization in this specific scenario.

jwnimmer-tri commented 7 months ago

I wonder what here might be actionable @SeanCurtis-TRI, especially now that we're doing convex hulls?

Is this ticket more about some specific kinds of benchmarks we could add and eventually speed up, or is it about helping flag for users that their input files are pants-on-fire crazy and they are holding it wrong?

RobotLocomotion / drake

Loading too many objs slows down the simulation significantly #13125

Recap of original problem characterization

Initial hypotheses and results