Fastsafetensors w/ GDS performance is much slower than safetensors

BabyChouSr commented 3 months ago

Setting: I'm currently trying to load the Huggingface Zephyr model but the fastsafetensors model loading is much slower.

Please first download the huggingface file first and then copy the filename into the ZEPHYR_MODEL_PATHS location.

huggingface-cli download HuggingFaceH4/zephyr-7b-beta --include "*.safetensors"

Reproduction Script:

import time

import torch
from safetensors.torch import safe_open
from fastsafetensors.loader import fastsafe_open

ZEPHYR_MODEL_PATHS = [
    f"/mnt/local_storage/data/cache/huggingface/hub/models--HuggingFaceH4--zephyr-7b-beta/snapshots/b70e0c9a2d9e14bd1e812d3c398e5f313e93b473/model-0000{i}-of-00008.safetensors" for i in range(1, 9)
]

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
state_dict = {}
cloned_state_dict = {}
start = time.perf_counter()
with fastsafe_open(ZEPHYR_MODEL_PATHS, device="cuda:0", nogds=False, debug_log=True) as f:
    for k in f.get_keys():
        state_dict[k] = f.get_tensor(k)

    end = time.perf_counter()

    f.bufs.close() # this releases memory... so my call to the safetensor run works or else OOM
print(f"fastsafe_open: {end - start} seconds")

start = time.perf_counter()
state_dict = {}
for model_path in ZEPHYR_MODEL_PATHS:
    with safe_open(model_path, framework="pt", device=0) as f:
        for k in f.keys():
            state_dict[k] = f.get_tensor(k)
end = time.perf_counter()
print(f"safe_open: {end - start} seconds")

Results:

fastsafe_open: 14.059521514000153 seconds
safe_open: 2.213277554999877 seconds

takeshi-yoshimura commented 3 months ago

Thank you for reporting the issue! To be honest, I have not tried the model with fastsafetensors, but I will analyze it and discuss what we should do later (hopefully this week).

BabyChouSr commented 3 months ago

Thank you for the prompt response and your great work!

takeshi-yoshimura commented 3 months ago

@BabyChouSr One thing should be noted here is that safe_open's get_tensor just does mmap, while fastsafe_open actually executes I/O. Please take a look at memory usages around page cache. or do some calculation on the loaded tensors.

BabyChouSr commented 3 months ago

I performed a couple matrix multiplications and there was not much performance difference. In terms of memory usage: here are some stats using the tensorizer.utils function from coreweave repo:

Fast safe open
Memory usage before:  CPU: (maxrss: 665MiB F: 8,089MiB) GPU: (U: 192MiB F: 22,299MiB T: 22,491MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage before:  CPU: (maxrss: 924MiB F: 7,842MiB) GPU: (U: 15,062MiB F: 7,429MiB T: 22,491MiB) TORCH: (R: 14,852MiB/14,852MiB, A: 8MiB/14,836MiB)

Safe open:
Memory usage before:  CPU: (maxrss: 666MiB F: 8,170MiB) GPU: (U: 192MiB F: 22,299MiB T: 22,491MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after:  CPU: (maxrss: 2,560MiB F: 8,173MiB) GPU: (U: 14,134MiB F: 8,357MiB T: 22,491MiB) TORCH: (R: 13,942MiB/13,942MiB, A: 13,812MiB/13,812MiB)

It seems like safe_open does move the tensors to the GPU?

takeshi-yoshimura commented 3 months ago

@BabyChouSr With my A100 GPU, fastsafetensors outperformed with your script (I used GDS compat mode, though).

fastsafe_open: 2.685559306293726 seconds
safe_open: 3.209021440707147 seconds

Like you said, yes, safe_open actually loaded tensors to a GPU. I overlooked the parameter device=0 for safe_open.

Please check your storage maximum throughput (FIO or something) and make sure dropping page cache before your tests for apple-to-apple comparison. Also, the PCI topology may affect the GDS performance as well. Peer-to-peer DMA could degrade performance very much on two devices on different PCI roots.

Another possibility is that you may use different GPUs with slow DtoD cudaMemcpy. For the files, fastsafetensors fixed up misaligned headers for some files, which cause memcpy inside the GPU memory.

takeshi-yoshimura commented 1 week ago

This issue was closed because it has been inactive.

foundation-model-stack / fastsafetensors

Fastsafetensors w/ GDS performance is much slower than safetensors #3