bigscience-workshop / Megatron-DeepSpeed

Ongoing research training transformer language models at scale, including: BERT & GPT-2
Other
1.31k stars 213 forks source link

Slower inference results for BLOOM fp16 on identical hardware #348

Open sarthaklangde opened 2 years ago

sarthaklangde commented 2 years ago

Hey,

Thank you for the scripts for loading checkpoints and running benchmarks. I have a strange issue that ds_inference fp16 throughput is quite slower than the results mentioned. But, the int8 benchmark results are almost identical.

Environment: GCP a2-ultragpu-8g with A100 8x80GB, 1.3 TB Memory, 96 vCPUs Debian 11

For fp16 & batch size 1, the throughput I receive is 67 msecs/token while it should be possible to get 44 msecs/token. This trend is repeated for higher batch sizes too.

But for int8, the results are exactly the same as the one mentioned in benchmarks (both for 8x80GB and 4x80GB).

What have I tried until now?

  1. Different CUDA versions (11.0, 11.4, 11.6, 11.7), PyTorch versions, DeepSpeed versions (0.7.0, 0.7.2, 0.7.3)
  2. Reinstalling environment from scratch on a new server

Any idea on what I might be doing wrong? Or is everybody else experiencing similar throughput?

mayank31398 commented 2 years ago

@sarthaklangde i have the same issue. I believe this might be due to the internal pcie tree implementation being different. @stas00 fyi

mayank31398 commented 2 years ago

I dont believe it has anything to do with your environment

stas00 commented 2 years ago

My tests were run on JeanZay HPC so it's possible their servers are somehow beefier hardware-wise?

It is interesting that you both report the same speed with int8.

@RezaYazdaniAminabadi, do you by chance have any insights at why this might be the case? What throughput do you get on your Azure nodes for bs=1 so that we have another point of comparison.

There are 2 hardware versions of A100. Do we know if A100s are all SXM and not PCIe by chance? As the latter are slower.

StochasticRomanAgeev commented 2 years ago

Could this be due to slow communication between GPUs?

After profiling, it turns out that communication takes up 66% of the time, and ncclKernel_AllReduce_RING_LL_Sum_half(ncclWorkElem) is used for this. Is it true that in the case of 8 GPUs all-to-all communication is expected?

Could you please tell what environment variables you are using, as the variables from here do not help.

stas00 commented 1 year ago

Unfortunately I no longer have access to JeanZay so I can't retrieve any more data at the moment.

Could this be due to slow communication between GPUs?

That's very possible. If you think of it this could also be an issue of PCIe vs NVLink (or even NVSwitch) and their generations.