Closed lilux618 closed 2 years ago
Strange... I just re-tested on our A100 40GB PCIe, and I get consistent results. Our system uses Nvidia driver 470.103.01, ECC is enabled on the A100. I also checked on 2 other systems with A100 40GB SXM4 and results are the same. FP16C is very heavy on compute, so you might get thermal throttling. Did you check for sufficient cooling?
I am very confused
I'm confused too. 3 possibilities you can check for:
I'm confused too. 3 possibilities you can check for:
- do you have the full GPU allocated for the job, or is it split in 2 instances and you only use 1? Yes , I have the full GPU allocated for this job , and it is not split in 2 instances. like the screen shot I show.
- hardware issue like insufficient cooling (maybe) or bad power delivery (unlikely) I don't think cooling is a problem . I have tested this A100 with HPL and HPCG benchmark, they are consistent with public results.
- check with older 470 driver, maybe the compiler on 510 was changed and gets confused (unlikely) I haven't test this
I have another question, even in your results list , the MLUPS with FP16C is less than that with FP16s , so , in which kind of case we should choose FP16C instead of FP16S ?
FP16S is memory compression to hardware-supported IEEE-754 FP16 format with 1 bit for sign, 5 bits for exponent und 10 bits for mantissa. The conversion is done in hardware, thus it does only double FLOPs/Byte compared to FP32, as it halves transferred Bytes but does not need significantly more FLOPs for the conversion.
FP16C is a custom floating-point format with 1 bit for sign, 4 bits for exponent und 11 bits for mantissa. This halves the truncation error compared to FP16S, so it's more accurate; though the difference is only visible in edge case scenarios. But conversion is not supported in hardware and has to be emulated in software, increasing FLOPs/Byte by a factor of ~8 compared to FP32. Hardware with very fast memory and at the same time low compute power struggles with that.
Bottom line, use
For more details, see this paper.
I run the exe in A100 40G ,and get the performance as followed , It give only 6105 MLUPS with FP16C but 14563MLUPS withFP16S. This is not consistent with the results in the table in README.