IBM / text-generation-inference

IBM development fork of https://github.com/huggingface/text-generation-inference
Apache License 2.0
52 stars 30 forks source link

use fastsafetensors #99

Open takeshi-yoshimura opened 3 months ago

takeshi-yoshimura commented 3 months ago

Motivation

Fastsafetensors (https://github.com/foundation-model-stack/fastsafetensors/) significantly reduces model loading time of .safetensors files on NVMe SSDs parallelizing file copies for maximum throughput, and offloading tensor sharding and type conversions to GPUs. Additionally, Fastsafetensors supports GPU Direct Storage (GDS) to further optimize cold startups from NVMe SSDs.

For instance, in a comparison between Fastsafetensors and the current loader, the cold startup time of bigscience/bloom at 4 NVMe and 8 A100 GPUs was reduced from 95 seconds to 28.4 seconds with Fastsafetensors. Similarly, the hot startup time (loading files at page cache) of llama-2-13b-chat was reduced from 8.7 seconds to 4.7 seconds with a single GPU using Fastsafetensors

Modifications

Fastsafetensors is provided as a PyPi package. It includes a compatible Weight class to load weights (fastsafetensors.connectors.tgis_weights.Weight). I modified tgis_native.py to replace the Weight class with environmental variables since I do not want to largely modify codes to add fastsafetensors configurations. Currently, the class has get_tensor and get_sahrded and others except for quantization.

I added two tuning knobs for the number of copy threads and bounce buffer sizes. They should be changed according to the available CPUs in a NUMA node and the size of the CPU cache, especially in hot startups, i.e., when files are on DRAM. My experiences of using Icelake servers implied that the size of bounce buffers should be L2 cache size (160MB in this case).

The package depends on libcufile.so and libnuma.so. So, we need to modify Dockerfile to add them in the final image.

For tensor parallel inference, Fastsafetensors first loads multiple files in a round-robin fashion for multiple GPUs. It then reuses torch.distributed methods to obtain or shard weights from one of the ranks. As a result, every Weight method, including get_tensor, may load tensors from a remote rank. This requires ensuring that both a sender rank and a getter rank are invoked to get a tensor, unlike before. While most cases do not require modifying existing classes, TensorParallelRowLinear has an if-else statement for rank 0's get_tensor. To address this, I changed get_tensor to push_tensor, which sets a sender and receiver rank (while other ranks do nothing).

Result

As described in the Motivation section, we observed speedups in both cold and hot startups of TGIS with various models. So far, I have tested with falcon-40b, bloom, and llama.

However, GDS limits the available optimization using PYTORCH_CUDA_ALLOC_CONF. Currently, we cannot set it to "expandable_segments:True" to use cuMemmap, which is not available with peer-to-peer transfers.

Additionally, using tensor parallel with Fastsafetensors can cause deadlocks at memory estimation. In my experiments, I always set ESTIMATE_MEMORY=off to avoid this issue. This is because Fastsafetensors can have different GPU memory consumptions among ranks, which can confuse the out-of-memory detection logic. Some ranks may continue running forward() while others do not, which fail to call get_tensor and others in the same order at every rank. The memory estimation seems to assume that every GPU consumes GPU memory evenly, which is not always the case with Fastsafetensors.

Related Issues

NA

njhill commented 3 months ago

Thanks @takeshi-yoshimura for the great work!

I have a few initial thoughts/questions:

I feel it would be better for the fastsafetensors library to be fully independent and not contain any inference server specific code, i.e. include the code from https://github.com/foundation-model-stack/fastsafetensors/blob/main/fastsafetensors/connectors/tgis_weights.py as a file in this PR.

Currently, the class has get_tensor and get_sahrded and others except for quantization.

Does this mean it won't work for quantized models? Is this particularly difficult or just something that you haven't implemented yet?

However, GDS limits the available optimization using PYTORCH_CUDA_ALLOC_CONF. Currently, we cannot set it to "expandable_segments:True" to use cuMemmap, which is not available with peer-to-peer transfers.

This optimization is only beneficial in some cases (for example I don't think its helpful when using paged attention). But in the cases that it is effective, we probably wouldn't want to use the faster loading because runtime performance is more important.

Additionally, using tensor parallel with Fastsafetensors can cause deadlocks at memory estimation. In my experiments, I always set ESTIMATE_MEMORY=off to avoid this issue.

I think this is more of a problem and we'd need to understand and resolve it. Is the GPU memory imbalance you mentioned just during the model loading phase, or it remains the case at runtime (I don't really understand if it's the latter).

I added two tuning knobs for the number of copy threads and bounce buffer sizes. They should be changed according to the available CPUs in a NUMA node and the size of the CPU cache

Could these be auto-configured? Ideally this would not be something that users have to tune on a case-by-case basis.

takeshi-yoshimura commented 3 months ago

Thanks @njhill.

I feel it would be better for the fastsafetensors library to be fully independent and not contain any inference server specific code, i.e. include the code from https://github.com/foundation-model-stack/fastsafetensors/blob/main/fastsafetensors/connectors/tgis_weights.py as a file in this PR.

I have just pushed the change including upgrading fastsafetensors to be 0.1.1 for minor bug fixes. I noticed that the Weight class has slightly changed since what fastsafetensors has.

Does this mean it won't work for quantized models? Is this particularly difficult or just something that you haven't implemented yet?

I just haven't implemented yet and I believe there are no technical difficulties. The Weight class just raises a NotImplementedError to call get_multi_weights_col and get_multi_weights_row with quantization parameters.

But in the cases that it is effective, we probably wouldn't want to use the faster loading because runtime performance is more important.

In that case, users can just turn off GDS by setting FST_NOGDS=1. Fastsafetensors still provide optimized file I/O and GPU offloading without GDS.

I think this is more of a problem and we'd need to understand and resolve it. Is the GPU memory imbalance you mentioned just during the model loading phase, or it remains the case at runtime (I don't really understand if it's the latter).

If I remember correctly, the failure happened after the model loading phase at https://github.com/IBM/text-generation-inference/blob/main/server/text_generation_server/utils/memory_characterizer.py#L360. fastsafetensors returns unused GPU memory to torch's CUDACachingAllocator, so it should not cause the imbalance or memory leaks. But it may cause fragmentations and raise OoM at model loading with TP.

I encountered the problem only when I tried bloom with TP=8. I could not reproduce the issue with llama-13b and falcon-40b on TP=2. I will investigate memory trace later to make sure if this happens on current TGIS and fastsafetensors.

Could these be auto-configured? Ideally this would not be something that users have to tune on a case-by-case basis.

Yes, we can provide auto-tuning configuration via environmental variables FST_CONFIG=auto to set things according to number of CPUs and cache sizes. I will add the feature later.

takeshi-yoshimura commented 3 months ago

I have added an auto-configuration. Thanks.