ProjectPhysX / FluidX3D

The fastest and most memory efficient lattice Boltzmann CFD software, running on all GPUs via OpenCL. Free for non-commercial use.
https://youtube.com/@ProjectPhysX
Other
3.82k stars 301 forks source link

The performance of fp16c is not as high as mentioned in readme. #2

Closed lilux618 closed 2 years ago

lilux618 commented 2 years ago

I run the exe in A100 40G ,and get the performance as followed , It give only 6105 MLUPS with FP16C but 14563MLUPS withFP16S. This is not consistent with the results in the table in README.

屏幕截图 2022-09-01 130409 lQLPJxak_yt8OWnNBNnNBTyw36pCti8-DDcDD2WhK4BCAA_1340_1241
ProjectPhysX commented 2 years ago

Strange... I just re-tested on our A100 40GB PCIe, and I get consistent results. Our system uses Nvidia driver 470.103.01, ECC is enabled on the A100. I also checked on 2 other systems with A100 40GB SXM4 and results are the same. FP16C is very heavy on compute, so you might get thermal throttling. Did you check for sufficient cooling?

image

lilux618 commented 2 years ago
image image

I am very confused

ProjectPhysX commented 2 years ago

I'm confused too. 3 possibilities you can check for:

lilux618 commented 2 years ago

I'm confused too. 3 possibilities you can check for:

  • do you have the full GPU allocated for the job, or is it split in 2 instances and you only use 1? Yes , I have the full GPU allocated for this job , and it is not split in 2 instances. like the screen shot I show.
  • hardware issue like insufficient cooling (maybe) or bad power delivery (unlikely) I don't think cooling is a problem . I have tested this A100 with HPL and HPCG benchmark, they are consistent with public results.
  • check with older 470 driver, maybe the compiler on 510 was changed and gets confused (unlikely) I haven't test this
lilux618 commented 2 years ago

I have another question, even in your results list , the MLUPS with FP16C is less than that with FP16s , so , in which kind of case we should choose FP16C instead of FP16S ?

ProjectPhysX commented 2 years ago

FP16S is memory compression to hardware-supported IEEE-754 FP16 format with 1 bit for sign, 5 bits for exponent und 10 bits for mantissa. The conversion is done in hardware, thus it does only double FLOPs/Byte compared to FP32, as it halves transferred Bytes but does not need significantly more FLOPs for the conversion.

FP16C is a custom floating-point format with 1 bit for sign, 4 bits for exponent und 11 bits for mantissa. This halves the truncation error compared to FP16S, so it's more accurate; though the difference is only visible in edge case scenarios. But conversion is not supported in hardware and has to be emulated in software, increasing FLOPs/Byte by a factor of ~8 compared to FP32. Hardware with very fast memory and at the same time low compute power struggles with that.

Bottom line, use

For more details, see this paper.