Closed BabyChouSr closed 1 week ago
Thank you for reporting the issue! To be honest, I have not tried the model with fastsafetensors, but I will analyze it and discuss what we should do later (hopefully this week).
Thank you for the prompt response and your great work!
@BabyChouSr One thing should be noted here is that safe_open's get_tensor just does mmap, while fastsafe_open actually executes I/O. Please take a look at memory usages around page cache. or do some calculation on the loaded tensors.
I performed a couple matrix multiplications and there was not much performance difference. In terms of memory usage: here are some stats using the tensorizer.utils
function from coreweave repo:
Fast safe open
Memory usage before: CPU: (maxrss: 665MiB F: 8,089MiB) GPU: (U: 192MiB F: 22,299MiB T: 22,491MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage before: CPU: (maxrss: 924MiB F: 7,842MiB) GPU: (U: 15,062MiB F: 7,429MiB T: 22,491MiB) TORCH: (R: 14,852MiB/14,852MiB, A: 8MiB/14,836MiB)
Safe open:
Memory usage before: CPU: (maxrss: 666MiB F: 8,170MiB) GPU: (U: 192MiB F: 22,299MiB T: 22,491MiB) TORCH: (R: 0MiB/0MiB, A: 0MiB/0MiB)
Memory usage after: CPU: (maxrss: 2,560MiB F: 8,173MiB) GPU: (U: 14,134MiB F: 8,357MiB T: 22,491MiB) TORCH: (R: 13,942MiB/13,942MiB, A: 13,812MiB/13,812MiB)
It seems like safe_open
does move the tensors to the GPU?
@BabyChouSr With my A100 GPU, fastsafetensors outperformed with your script (I used GDS compat mode, though).
fastsafe_open: 2.685559306293726 seconds
safe_open: 3.209021440707147 seconds
Like you said, yes, safe_open actually loaded tensors to a GPU. I overlooked the parameter device=0
for safe_open
.
Please check your storage maximum throughput (FIO or something) and make sure dropping page cache before your tests for apple-to-apple comparison. Also, the PCI topology may affect the GDS performance as well. Peer-to-peer DMA could degrade performance very much on two devices on different PCI roots.
Another possibility is that you may use different GPUs with slow DtoD cudaMemcpy. For the files, fastsafetensors fixed up misaligned headers for some files, which cause memcpy inside the GPU memory.
This issue was closed because it has been inactive.
Setting: I'm currently trying to load the Huggingface Zephyr model but the fastsafetensors model loading is much slower.
Please first download the huggingface file first and then copy the filename into the
ZEPHYR_MODEL_PATHS
location.Reproduction Script:
Results: