Reworked this to be more robust and deterministic. The basic idea is to select all of the heaps that are DEVICE_LOCAL and are of sufficient size, and use those for allocating "storage" (ie device) memory. This is used for Buffer's, compute operations. Then "mapping" memory is selected in a similar way, for heaps that are not DEVICE_LOCAL, sorted such that we prefer CPU_CACHED and require CPU_VISIBLE and COHERENT. The trick is that some drivers, at least Microsoft Basic Render Driver (used for testing on github actions) has just one heap that is DEVICE_LOCAL. So in the case that there isn't a non DEVICE_LOCAL heap, we don't require mapping memory to be not DEVICE_LOCAL. Mapping memory is used for staging buffers for both writes and reads.
In theory, for configs where you have DEVICE_LOCAL and CPU_VISIBLE memory, the host can write and read directly from this memory (so long as the gpu isn't using it), which saves a copy. However, this is very complex and in fact the storage allocator aliases Buffer's to allow for temporaries to be reused, which isn't the case for writes and reads which have to be fully allocated for a given frame. The allocation scheme is different too, because of the different usage pattern, so it would be difficult to take advantage of being able to say write directly into a buffer from the host and use it in a compute shader, and read back the results without staging buffers. In general the assumption is that this overhead is small when the device has to do a bunch of work with the data but for inference on mobile this may be a consideration.
I added tests to verify expected behavior on each of my dev platforms, and I think this should work on anything that is in fact supported. Unfortunately I can't find the relevant info to create additional tests without extracting the memory config manually via cargo test allocator_config_diagnostic -- --ignored --nocapture, which will print out the MemoryProperties and it's then simple to construct a unit test for those properties.
Fixes #46
Reworked this to be more robust and deterministic. The basic idea is to select all of the heaps that are DEVICE_LOCAL and are of sufficient size, and use those for allocating "storage" (ie device) memory. This is used for Buffer's, compute operations. Then "mapping" memory is selected in a similar way, for heaps that are not DEVICE_LOCAL, sorted such that we prefer CPU_CACHED and require CPU_VISIBLE and COHERENT. The trick is that some drivers, at least Microsoft Basic Render Driver (used for testing on github actions) has just one heap that is DEVICE_LOCAL. So in the case that there isn't a non DEVICE_LOCAL heap, we don't require mapping memory to be not DEVICE_LOCAL. Mapping memory is used for staging buffers for both writes and reads.
In theory, for configs where you have DEVICE_LOCAL and CPU_VISIBLE memory, the host can write and read directly from this memory (so long as the gpu isn't using it), which saves a copy. However, this is very complex and in fact the storage allocator aliases Buffer's to allow for temporaries to be reused, which isn't the case for writes and reads which have to be fully allocated for a given frame. The allocation scheme is different too, because of the different usage pattern, so it would be difficult to take advantage of being able to say write directly into a buffer from the host and use it in a compute shader, and read back the results without staging buffers. In general the assumption is that this overhead is small when the device has to do a bunch of work with the data but for inference on mobile this may be a consideration.
I added tests to verify expected behavior on each of my dev platforms, and I think this should work on anything that is in fact supported. Unfortunately I can't find the relevant info to create additional tests without extracting the memory config manually via
cargo test allocator_config_diagnostic -- --ignored --nocapture
, which will print out the MemoryProperties and it's then simple to construct a unit test for those properties.