Open tonylek opened 3 days ago
Could you please share your network hardware, PCIE or Nvlink? What's the bandwidth? Tensor parallelism introduces extra allreduce operations. If the communication overhead is big, it may even be a negative optimization.
This is the gpu: https://d1.awsstatic.com/product-marketing/ec2/NVIDIA_AWS_A10G_DataSheet_FINAL_02_17_2022.pdf
From what I understand it is PCI-E with 64 gb/s. Do you think this is the reason? Is there anything I can do?
I also had to use :--use_custom_all_reduce disable
Because I got errors if I didn't
Hi, I'm running on aws a10g and I'm trying to perform some benchmarking of different setups.
I tried to shard the model to 2 gpus to make it faster but I'm getting the same latency. Does this makes sense? The model I'm using is starcoder2-3b: https://huggingface.co/bigcode/starcoder2-3b
My conversion scripts are the following:
The requests I'm checking are 1600 input tokens and 200 output tokens. The latency I'm getting is 568 ms on a single gpu and 549 ms on 2 gpus.
When quantizing the model I'm even getting better latency when not sharding. This is my conversion script for quantization to int8:
1 gpu - 416 ms 2 gpu -463 ms