Closed bchess closed 7 months ago
@bchess How does this affect tensors that are already on the CPU?
@bchess How does this affect tensors that are already on the CPU?
Not quite sure what you're referring to. This function is only called for cuda tensors. It wouldn't have any effect for tensors that are already on the CPU.
Similar to plaid mode, re-use one pinned buffer to handle the data transfer from GPU to CPU for serialization.
In main, serializing gpt-j-6B fp16 to nvme took 8.375s In this branch, takes 4.796s