CPU/GPU transfer - Githubissues

sshleifer commented 1 year ago

Really cool work. I am trying to optimize CPU/GPU transfer of attention cache tensors for a large language model that I run on multiple GPUs. I also don't need to use disk and don't need to keep parts of the same tensor on different devices. So I don't think I can use flexgen out of the box and am now just trying to understand whether your code is much faster at copying tensors between cpu and gpu over basic .cpu() and .gpu(). If so is there a place in the codebase with faster cpu/gpu copying utilities? Or do you have a general strategy (like always pin_memory() and then use non-blocking=True/ try to reuse pre-allocated buffers?

Ying1123 commented 1 year ago

Here is how we do copy if you are interested. https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/flexgen/pytorch_backend.py#L790-L797

We do not have other fancy methods other than using pin_memory and asynchronous copy, but we do implement a nice and general interface for Tensors that can be stored on CPU/GPU/Disk https://github.com/FMInference/FlexGen/blob/0342e2a0e93593b2c11f84be0e9f5d5bcb73e598/flexgen/pytorch_backend.py#L54-L59

swordfate commented 1 year ago

@Ying1123 Excuse me. I want to know if the data in GPU will be released when using general_copy (assuming dst is CPU). I think that using general_copy() may cause a memory leak in GPU because I don't see any code releasing the source data in GPU.

FMInference / FlexLLMGen

CPU/GPU transfer #38