Idea to use BigUint64Array

egonelbre commented 9 months ago

While exploring the project I noticed that the sort was taking a significant amount of time.

https://github.com/antimatter15/splat/blob/d85a1efa260f042ba76314b35af65a1896467242/main.js#L307

I was thinking about how to avoid the separate typed arrays for sorting and realized that it might be possible to pack the content into a single BigUint64Array or BigInt64Array such that the native sort would yield the correct resolve.

So something along the lines of:

const depth = new BigUint64Array(vertexCount);
for (let j = 0; j < vertexCount; j++) {
    let z = viewProj[2] * f_buffer[8 * j + 0] +
                viewProj[6] * f_buffer[8 * j + 1] +
                viewProj[10] * f_buffer[8 * j + 2];
    depth[j] = BigInt(z * scaleToU32 + offsetToU32) << 32n | BigInt(j);
}
depth.sort();

Sorting a BigUint64Array vs the two arrays was about 3x faster on my computer.

However, I'm not quite sure what the appropriate scaling constants would be, but I assume one exists. Or maybe some other transformation that would allow this to work.

Unfortunately I don't have time to experiment with this yourself, but this approach could be an easy perf win compared to pushing sorting to GPU or alternate approaches.

antimatter15 commented 9 months ago

That's a really interesting idea! I wonder if we reduce the precision of depth to 7 bits we can do it with 32 bit ints and whether or not that would be usable

antimatter15 commented 9 months ago

I had been thinking something along the lines of using something like timsort so that the sorting process can leverage existing partially sorted regions, or to pre sort along the x/y/Z axes and somehow having some kind of "merge" like operator that can assemble a sorted list more quickly

egonelbre commented 9 months ago

I do agree that using pre-sorted regions or even just preserving the previously sorted array should speed things up. I'm not sure whether JS has this property, but many usual sorting algorithms work faster on a mostly sorted array.

antimatter15 commented 9 months ago

It depends on the implementation, I think chrome switched to timsort which is adaptive in that way, but in my experience it is still somewhat slow even when sorting a fully sorted list.

JayFoxRox commented 9 months ago

but many usual sorting algorithms work faster on a mostly sorted array.

Yes. I was also working on my own implementation and this was much faster.

or to pre sort along the x/y/Z axes

Yes, also see https://iquilezles.org/articles/volumesort/ (Personally I was also messing around with R-Trees and only sorting within the tree but storing the sort order in some levels)

depth[j] = BigInt(z * scaleToU32 + offsetToU32) << 32n | BigInt(j);

You shouldn't even have to do the scale and offset. IEEE floats are sorted even if interpreted as integer (unless negative, which isn't an issue here). So you can access your floats as a Uint32Array and just shift that exact value in the upper 32 bits.

For WebGL2, you can also do the transformation (for depth calculation) in a transform-feedback shader. Alternatively you can also render into a float texture (for WebGL1 with extensions), but this probably only benefits you if you also do the sorting and indexing on the GPU (otherwise the GPU → CPU → GPU round-trip is probably horribly slow).

antimatter15 commented 9 months ago

Wow that looks great- seems like we could just allocate a BigInt64Array and then initialize a Float32Array with a shared buffer to write the depths (though I don't think we can ignore negatives, because things behind the camera need to still be represented in case you move backwards- though I think it should be fine to just add a fixed number like 1000 and clip numbers that are below zero)

Just calling .sort() and not passing in a lambda seems really important- seems like the function call overhead is very substantial. Seems like sorting an already mostly sorted list can go from hundreds of milliseconds to under a millisecond

And for subsequent calls we can load the same buffer as a UInt32Array and use the odd indices to fetch the appropriate positions and then set the even ones to the float values

Also seems like it would be possible per the inigo technique to check which hectohectant the current view is in and skip the sort altogether if that hasn't changed enough to warrant a re-sort at all.

On Tue, Sep 12, 2023 at 3:50 PM Jannik Vogel @.***> wrote:

but many usual sorting algorithms work faster on a mostly sorted array.

Yes. I was also working on my own implementation and this was much faster.

or to pre sort along the x/y/Z axes

Yes, also see https://iquilezles.org/articles/volumesort/

depth[j] = BigInt(z * scaleToU32 + offsetToU32) << 32n | BigInt(j);

You shouldn't even have to do the scale and offset. IEEE floats are sorted even if interpreted as integer (unless negative, which isn't an issue here). So you can access your floats as a Uint32Array and just shift that exact value in the upper 32 bits.

For WebGL2, you can also do the transformation (for depth calculation) in a transform-feedback shader. Alternatively you can also render into a float texture (for WebGL1 with extensions), but this probably only benefits you if you also do the sorting and indexing on the GPU (otherwise the GPU → CPU → GPU round-trip is probably horribly slow).

— Reply to this email directly, view it on GitHub https://github.com/antimatter15/splat/issues/5#issuecomment-1716622469, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAHKZVMVCUREF5NJT5L6NTX2DRJVANCNFSM6AAAAAA4U2K42M . You are receiving this because you commented.Message ID: @.***>

JayFoxRox commented 9 months ago

(though I don't think we can ignore negatives, because things behind the camera need to still be represented in case you move backwards- though I think it should be fine to just add a fixed number like 1000 and clip numbers that are below zero)

It's adding a bit more work again, but you can do some bit twiddling to avoid this: https://stackoverflow.com/a/43305015

per the inigo technique to check which hectohectant the current view is in and skip the sort altogether if that hasn't changed enough to warrant a re-sort at all.

Note that this technique expects all elements on a grid - we don't have that here. It will work, but you'll lose some correctness. Because most scenes are in 2D (limited verticality), my idea was to sort it in 2D in about 2-8 directions for a rough estimation.. and then sort within those for final quality (also to support verticallity if necessary).

However, that was before I noticed that sorting probably isn't an issue, because sorting already sorted lists can be very fast. So you might not even need the precompution.

antimatter15 commented 9 months ago

I implemented these changes in e7bc5f9

On my computer beforehand the sorting process was taking about 370ms.

Switching to the BigInt64Array made it about 100ms.

Curiously enough it seems like keeping the mostly sorted array between sorts has a negligible effect on performance.

antimatter15 / splat

Idea to use BigUint64Array #5

You shouldn't even have to do the scale and offset. IEEE floats are sorted even if interpreted as integer (unless negative, which isn't an issue here). So you can access your floats as a Uint32Array and just shift that exact value in the upper 32 bits.