Crate write out gets exponentially slower with high time sample count

dgovil commented 3 weeks ago

Description of Issue

We're seeing a reproducible issue across USD 24.2-24.8 (didn't test earlier) where writing crate becomes exponentially slower as you add more samples.

With 100 samples, crate takes 2.5x longer than a USDA to write on average
With 10,000 samples, crate takes 15x longer. (0.105 vs 0.006)
With 20,0000 samples, it takes about 30x longer (0.4s vs 0.014s)
With a single attribute at 30,000 samples , it takes 36x longer to write out than a text file. (0.75s vs 0.02s)
When this jumps to 35,000 samples, it does not finish writing the crate file on my M2 Ultra after even 10 minutes

We've reproduced this fairly reliably on multiple machines and builds of USD. The ratios are always roughly the same and it gives up at about the same count.

Additionally it completely pegs a single core at 100%. It has roughly 48 threads active at this time on my system (I have 24 cores, so it seems to double the cores?) but most(all but one?) seem to be inactive during this. It also doesn't seem to use much memory at all during this time, at roughly 77MB.

The chart here shows how much slower than USDA crate becomes as the samples get larger. Granted my sample sizes start jumping after a bit, but it shows somewhat linear growth but then at about 35k samples it just won't complete at all. Bear in mind this is a multiplier on the usda document write speed which is roughly O(N). This makes crate something like O(n^2).

Some observations:

This doesn't seem to change depending on the data type of the attribute. Only the number of samples seems to change it.
Timesamples are stored with their key as a DoubleVector. So in theory there should be no compression feeding into this either. Unless I'm mistaken, that should be faster than usda because it should just be able to dump things linearly as it writes out, without needing to format things...so its odd that its slower even at a low sample count.

Steps to Reproduce

Run this script
Get a few coffees

from pxr import Usd, Sdf
import tempfile, os, time

stage = Usd.Stage.CreateInMemory()
prim = stage.DefinePrim("/test")
attr = prim.CreateAttribute("foo", Sdf.ValueTypeNames.Float)

count = (1, 100, 1_000, 10_000, 20_000, 30_000, 35_000, 40_000)

for c in count:
    print(f"Running with {c} samples")

    for i in range(c):
        attr.Set(float(i), i)

    print("Done creating stage...now trying to write it out")
    with tempfile.TemporaryDirectory() as tmpdir:
        usdc = os.path.join(tmpdir, "foo.usdc")
        usda = os.path.join(tmpdir, "foo.usda")

        start = time.time()
        stage.Export(usda)
        usda_elapsed = time.time() - start
        print(f"\tExporting Text took {usda_elapsed}s")

        start = time.time()
        stage.Export(usdc)
        crate_elapsed = time.time() - start
        print(f"\tExporting Crate took {crate_elapsed}s")

        print(f"Crate took {crate_elapsed/usda_elapsed}x longer than Text with {c} samples")

System Information (OS, Hardware)

macOS 15 Sequoia, M2 Ultra Mac Studio

Package Versions

USD 24.8

dgovil commented 2 weeks ago

So thanks to Damien, we discovered a few things:

The performance issues don't seem to happen on Linux. I need to test that one out further, but I'm guessing there's some UB or GCC specific behaviour somewhere in here.
Robin Hash includes an int16 that overflows after 32,767 entries, causing an infinite loop when it wraps around. Switching that to an int64 prevents the overflow, at least to any sample count that USD could express otherwise. We're unsure if there's any side effects of that? If we think that's fine, I can submit a PR for that change.

It doesn't solve the performance issue though of course, but just punts it down to a much larger number. I'm not sure I follow why the samples go through the hashing algorithm though, other than because time samples happen to be a map? But I think a map would already be deduplicated prior to write, or the USDA file would be invalid too?

Here's a few more images after the change, now going up to 100k samples

Another odd behaviour, but not as consequential, is that writing a single sample is always slower than writing 100. Even 2 samples is half the time of 1 sample for text, and 75% the time for crate.

jesschimein commented 2 weeks ago

Filed as internal issue #USD-10028

gitamohr commented 4 days ago

I'm looking at this one today and it doesn't repro for me on Linux with Clang -- I modified the script to run up to 1 million samples, and this log/log plot shows good linear scaling with both formats (y-axis is seconds). This is with the 1.3.0 robin_map code, so maybe that fixed it, or maybe this is a mac-specific thing. I'll see about repro'ing on mac next.

dgovil commented 4 days ago

I think it might be an arm specific thing (so might not repro on an Intel Mac), but perhaps the Robinhash change fixed it as well. Just wanted to mention that so you can test the repro on an arm machine.

gitamohr commented 4 days ago

Thanks Dhruv -- I have an M1 laptop that I'll try with.

gitamohr commented 4 days ago

Ah okay, this isn't mac specific, but libc++ specific. My Linux Clang build was using libstdc++ from gcc. The hash function for double in libc++ just copies the floating-point bit pattern, and the robin_map just masks off bits to go from a hash to an index, so hash table that the crate implementation is using during export gets ~100% collisions, leading to the behavior you saw. I'll fix this up.

dgovil commented 4 days ago

Oh very interesting. That's a great catch.

gitamohr commented 3 days ago

Just changing the hasher to TfHash for that map fixes it. With that, the the results look nice and linear as expected. Thanks for the excellent repro script and analysis!

dgovil commented 3 days ago

Excellent. Thanks again for the quick diagnosis and fix. I'll close this issue out once the commit lands so I can link to it in here.

PixarAnimationStudios / OpenUSD