Tencent / TurboTransformers

a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU and GPU.
Other
1.48k stars 198 forks source link

Benchmarks use .half() (FP16)? #156

Open djstrong opened 4 years ago

djstrong commented 4 years ago

Do torch versions in benchmark https://github.com/Tencent/TurboTransformers/blob/master/docs/bert.md use .half() (FP16)?

feifeibear commented 4 years ago

No, we use FP32.

djstrong commented 4 years ago

Using transformers, FP16 on GPU usually does not change the scores, but the inference is faster 3-4 times. I hope for FP16 benchmarks using turbotransformers.

feifeibear commented 4 years ago

Interesting. Feedbacks from our customers indicate our FP32 version is fast enough. We believe quantization on CPU is more intensive, therefore we currently have no plan for GPU FP16. We will do it later.