I'll just copy-paste Jayce's explanation because I don't fully understand it.
The underlying problem was that all of the CUDA operations are done asynchronously. Normally Bifrost forces a sync when the data cross a memory space boundary. However, when using a CUDA-based ring you end up with data being written that isn't actually there until the memcpy completes. The reserve on the ring doesn't know about this so when that reserve is released the next block starts to read. That read can start before the copy finishes.
I'll just copy-paste Jayce's explanation because I don't fully understand it.