coreweave / tensorizer

Module, Model, and Tensor Serialization/Deserialization
MIT License
180 stars 25 forks source link

perf(serialization) Reuse a pinned buffer when copying from gpu #98

Closed bchess closed 7 months ago

bchess commented 7 months ago

Similar to plaid mode, re-use one pinned buffer to handle the data transfer from GPU to CPU for serialization.

In main, serializing gpt-j-6B fp16 to nvme took 8.375s In this branch, takes 4.796s

wbrown commented 7 months ago

@bchess How does this affect tensors that are already on the CPU?

bchess commented 7 months ago

@bchess How does this affect tensors that are already on the CPU?

Not quite sure what you're referring to. This function is only called for cuda tensors. It wouldn't have any effect for tensors that are already on the CPU.