Optimize chunk transform streaming

Ralith commented 4 years ago

Currently, for each frame, for each chunk, we invoke vkCmdUpdateBuffer with the transform from that chunk to the local node. In a valley, this can add up to hundreds of kilobytes. This is a bit of an abuse of vkCmdUpdateBuffer and may explain the large CPU time spent preparing to render chunks. There are a number of improvements to be made:

[x] Use a ~staging~ mapped buffer ~and transfer command~. This should mitigate driver overhead, and may improve performance substantially all on its own.
[ ] Because the underlying honeycomb is regular, we can drastically reduce the amount of bandwidth used by storing a precomputed table of transforms to the origin node from the chunks surrounding the origin node out to the maximum view distance, and maintaining a buffer of indices mapping the neighborhood of the player to analogous chunks surrounding the origin. This buffer is 1/32 the size of the current transform buffer, and would need to be rewritten every time the player moves between nodes, but small incremental writes could be used otherwise. This also saves us from doing a bunch of matrix multiplication as we traverse the graph, which might improve traversal performance significantly (currently 2-4ms/frame).
[x] As of #53, chunk transform information (of whatever nature) can be passed through an instance buffer rather than looked up in a storage buffer, simplifying and perhaps slightly optimizing the vertex shader.

Ralith commented 4 years ago

Partially fixed by #63. CPU use during graph traversal remains significant, but performance is much improved overall.

Ralith commented 4 years ago

The precomputed transform table could also potentially form a foundation for removing per-chunk draw calls, in favor of a multi-draw-indirect with compute frustum culling.

Ralith / hypermine

Optimize chunk transform streaming #55