Open dgovil opened 3 weeks ago
So thanks to Damien, we discovered a few things:
The performance issues don't seem to happen on Linux. I need to test that one out further, but I'm guessing there's some UB or GCC specific behaviour somewhere in here.
Robin Hash includes an int16 that overflows after 32,767 entries, causing an infinite loop when it wraps around. Switching that to an int64 prevents the overflow, at least to any sample count that USD could express otherwise. We're unsure if there's any side effects of that? If we think that's fine, I can submit a PR for that change.
It doesn't solve the performance issue though of course, but just punts it down to a much larger number. I'm not sure I follow why the samples go through the hashing algorithm though, other than because time samples happen to be a map? But I think a map would already be deduplicated prior to write, or the USDA file would be invalid too?
Here's a few more images after the change, now going up to 100k samples
Another odd behaviour, but not as consequential, is that writing a single sample is always slower than writing 100. Even 2 samples is half the time of 1 sample for text, and 75% the time for crate.
Filed as internal issue #USD-10028
I'm looking at this one today and it doesn't repro for me on Linux with Clang -- I modified the script to run up to 1 million samples, and this log/log plot shows good linear scaling with both formats (y-axis is seconds). This is with the 1.3.0 robin_map code, so maybe that fixed it, or maybe this is a mac-specific thing. I'll see about repro'ing on mac next.
I think it might be an arm specific thing (so might not repro on an Intel Mac), but perhaps the Robinhash change fixed it as well. Just wanted to mention that so you can test the repro on an arm machine.
Thanks Dhruv -- I have an M1 laptop that I'll try with.
Ah okay, this isn't mac specific, but libc++
specific. My Linux Clang build was using libstdc++
from gcc
. The hash function for double
in libc++
just copies the floating-point bit pattern, and the robin_map
just masks off bits to go from a hash to an index, so hash table that the crate implementation is using during export gets ~100% collisions, leading to the behavior you saw. I'll fix this up.
Oh very interesting. That's a great catch.
Just changing the hasher to TfHash
for that map fixes it. With that, the the results look nice and linear as expected. Thanks for the excellent repro script and analysis!
Excellent. Thanks again for the quick diagnosis and fix. I'll close this issue out once the commit lands so I can link to it in here.
Description of Issue
We're seeing a reproducible issue across USD 24.2-24.8 (didn't test earlier) where writing crate becomes exponentially slower as you add more samples.
We've reproduced this fairly reliably on multiple machines and builds of USD. The ratios are always roughly the same and it gives up at about the same count.
Additionally it completely pegs a single core at 100%. It has roughly 48 threads active at this time on my system (I have 24 cores, so it seems to double the cores?) but most(all but one?) seem to be inactive during this. It also doesn't seem to use much memory at all during this time, at roughly 77MB.
The chart here shows how much slower than USDA crate becomes as the samples get larger. Granted my sample sizes start jumping after a bit, but it shows somewhat linear growth but then at about 35k samples it just won't complete at all. Bear in mind this is a multiplier on the usda document write speed which is roughly O(N). This makes crate something like O(n^2).
Some observations:
Steps to Reproduce
System Information (OS, Hardware)
Package Versions
USD 24.8