The performance of fp16c is not as high as mentioned in readme.

lilux618 commented 2 years ago

I run the exe in A100 40G ，and get the performance as followed , It give only 6105 MLUPS with FP16C but 14563MLUPS withFP16S. This is not consistent with the results in the table in README.

lQLPJxak_yt8OWnNBNnNBTyw36pCti8-DDcDD2WhK4BCAA_1340_1241

ProjectPhysX commented 2 years ago

Strange... I just re-tested on our A100 40GB PCIe, and I get consistent results. Our system uses Nvidia driver 470.103.01, ECC is enabled on the A100. I also checked on 2 other systems with A100 40GB SXM4 and results are the same. FP16C is very heavy on compute, so you might get thermal throttling. Did you check for sufficient cooling?

lilux618 commented 2 years ago

I am very confused

ProjectPhysX commented 2 years ago

I'm confused too. 3 possibilities you can check for:

do you have the full GPU allocated for the job, or is it split in 2 instances and you only use 1?
hardware issue like insufficient cooling (maybe) or bad power delivery (unlikely)
check with older 470 driver, maybe the compiler on 510 was changed and gets confused (unlikely)

lilux618 commented 2 years ago

I'm confused too. 3 possibilities you can check for:

do you have the full GPU allocated for the job, or is it split in 2 instances and you only use 1? Yes , I have the full GPU allocated for this job , and it is not split in 2 instances. like the screen shot I show.

hardware issue like insufficient cooling (maybe) or bad power delivery (unlikely) I don't think cooling is a problem . I have tested this A100 with HPL and HPCG benchmark, they are consistent with public results.

check with older 470 driver, maybe the compiler on 510 was changed and gets confused (unlikely) I haven't test this

lilux618 commented 2 years ago

I have another question, even in your results list , the MLUPS with FP16C is less than that with FP16s ， so , in which kind of case we should choose FP16C instead of FP16S ?

ProjectPhysX commented 2 years ago

FP16S is memory compression to hardware-supported IEEE-754 FP16 format with 1 bit for sign, 5 bits for exponent und 10 bits for mantissa. The conversion is done in hardware, thus it does only double FLOPs/Byte compared to FP32, as it halves transferred Bytes but does not need significantly more FLOPs for the conversion.

FP16C is a custom floating-point format with 1 bit for sign, 4 bits for exponent und 11 bits for mantissa. This halves the truncation error compared to FP16S, so it's more accurate; though the difference is only visible in edge case scenarios. But conversion is not supported in hardware and has to be emulated in software, increasing FLOPs/Byte by a factor of ~8 compared to FP32. Hardware with very fast memory and at the same time low compute power struggles with that.

Bottom line, use

FP32 when accuracy is the main constraint
FP16C when both memory and accuracy are the main constraints
FP16S when both memory and compute time are the main constraints

For more details, see this paper.

ProjectPhysX / FluidX3D

The performance of fp16c is not as high as mentioned in readme. #2