Further optimize certain copies

Now that we only have one single DxvkContext, we can be a bit more aggressive about pulling copies into the init command buffer in order to reduce barrier and/or render pass counts in some cases, ideally without discarding the destination resource.

Previously, we would only do this on full buffer uploads, this new code covers image uploads and actual GPU->GPU buffer copies as well.

An alternative implementation would work based on barrier tracking and be more granular, but since that's much more expensive I don't think it's warranted, especially since the issue we're trying to solve isn't all that common to begin with.

There's probably more we can do here (moving image ClearUAV perhaps, or at least related barriers), need to investigate.

doitsujin / dxvk

Further optimize certain copies #4409