Try / Tempest

API abstraction layer for 3D graphics, UI and sound. Written in C++17 with Vulkan, DX12 and Metal support.
MIT License
107 stars 27 forks source link

automatic PipelineBarriers investigation #27

Open Try opened 2 years ago

Try commented 2 years ago

This ticket is to track the ideas/solutions to pipeline barriers generation. Mostly it's about Vulkan perspective, yet DirectX12 is also to take in consideration.

Strategy:

All image resources assume a default read-to-read.

dest stage is ALL_COMMANDS (except for Depth) for now. Compute: all storage resources are tracked individually; if current pipeline has a unordered access - new set of barriers has to be issues.

Code sample:

auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // Assume 'ready-to-read' state
  enc.dispatch(...);
  // tex1: ALL -> COLOR
  enc.setFramebuffer({{tex1,Vec4(0,0,1,1),Tempest::Preserve}});
  enc.draw(...);
  // tex1: COLOR -> ALL 
  // tex2: ALL -> COLOR
  enc.setFramebuffer({{tex2,Tempest::Discard,Tempest::Preserve}});
  enc.draw(...);
  // tex2: COLOR -> ALL 
}

Points of interest

Optimize:

Problems:

  1. Readable depth doesn't "fit" into this paradigm.
  2. COLOR -> ALL pipeline bubble, for shadow-maps (and most of other rendering scenarios)
  3. [DX12] UAV barriers a not allowed inside a renderpass.

Api-limitations

7.9. Host Write Ordering Guarantees

When batches of command buffers are submitted to a queue via a queue submission command, it defines a memory dependency with prior host operations, and execution of command buffers submitted to the queue.

This makes it easier on resource upload/uniform buffers side, yet still command buffer must assume any commands submitted before.

7.6. Pipeline Barriers

If vkCmdPipelineBarrier2KHR is recorded within a render pass instance, the synchronization scopes are limited to operations within the same subpass.

This may cause troubles, if barriers are delayed.

7.6.1. Subpass Self-dependency

vkCmdPipelineBarrier or vkCmdPipelineBarrier2KHR must not be called within a render pass instance started with vkCmdBeginRenderingKHR.

Since VK_KHR_dynamic_rendering is a go-to extension, barriers must not be issued in renderpass. This limitation basically blocks any split-barrier or partial-barrier approaches.

Try commented 2 years ago

Compute cases:

auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // ? -> write
  enc.dispatch(&buf); // WAR barrier
  //  write -> read
  enc.dispatch(const &buf); // RAW barrier
  //  read-> read
  enc.dispatch(const &buf); // no barrier
}
auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // ? -> read
  enc.dispatch(const &buf); // no barrier
}
auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // ? -> write
  enc.dispatch(&buf); // WAR barrier
  // write -> read, since we don't know what followup submission is going to be
}
auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // ? -> write
  enc.dispatch(&buf);
  // usage of buf in rendering?
  enc.setFramebuffer(...);
  enc.draw(&buf); 
}

Transfer

Transferring here is very much engine magic, so barriers can be explicit. Magic function: device.vbo device.ibo device.ssbo device.loadTexture device.readPixels device.readBytes `Buffer::update`

Command buffer functions: copy(const Attachment& src, uint32_t mip, StorageBuffer& dest, size_t offset); void generateMipmaps(Attachment& tex);

PS: At this point presumption is that, before command buffer execution, buf had eider no access or read-only access, from previous submissions.

Try commented 2 years ago

Corner case with compute pre-pass:

auto cmd  = device.commandBuffer();
{
  auto enc = cmd.startEncoding(device);
  // ALL -> COMP 
  enc.dispatch(&buf);
  // COMP -> ALL
  enc.setFramebuffer(...);
  enc.draw(&buf); 
}

Issuing all barrier is fine fox DX12 (it's just UAV barrier), but not for Vulkan. Specially not in TBR hardware. As side idea:

  1. Skip WAR barriers at command recording, but store them for later
  2. Add a separated tracker on device-level, to issue a fine grained barrier: [HISTORY] -> COMP
Try commented 2 years ago

New idea of dependency tracking/barrier emit.

For buffers and storage images layout transition is not a needed, so coarse-grained tracking is possible. For each 'important' shader-stage engine now tracks bitmask of readed resources and written resources. This allowed to test for barrier need with a few of bit operations, with Read-Write granularity.

Currently(5cf3ca3 commit) engine distinct compute vs graphics stages, producing nice comp-comp barriers for dispatch and comp-graphics at start of renderpass