Closed sshleifer closed 1 year ago
Here is how we do copy if you are interested. https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/flexgen/pytorch_backend.py#L790-L797
We do not have other fancy methods other than using pin_memory and asynchronous copy, but we do implement a nice and general interface for Tensors that can be stored on CPU/GPU/Disk https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/flexgen/pytorch_backend.py#L54-L59
@Ying1123
Excuse me. I want to know if the data in GPU will be released when using general_copy (assuming dst is CPU). I think that using general_copy()
may cause a memory leak in GPU because I don't see any code releasing the source data in GPU.
Really cool work. I am trying to optimize CPU/GPU transfer of attention cache tensors for a large language model that I run on multiple GPUs. I also don't need to use disk and don't need to keep parts of the same tensor on different devices. So I don't think I can use flexgen out of the box and am now just trying to understand whether your code is much faster at copying tensors between cpu and gpu over basic
.cpu()
and.gpu()
. If so is there a place in the codebase with faster cpu/gpu copying utilities? Or do you have a general strategy (like alwayspin_memory()
and then usenon-blocking=True
/ try to reuse pre-allocated buffers?