Open djstrong opened 4 years ago
No, we use FP32.
Using transformers, FP16 on GPU usually does not change the scores, but the inference is faster 3-4 times. I hope for FP16 benchmarks using turbotransformers.
Interesting. Feedbacks from our customers indicate our FP32 version is fast enough. We believe quantization on CPU is more intensive, therefore we currently have no plan for GPU FP16. We will do it later.
Do torch versions in benchmark https://github.com/Tencent/TurboTransformers/blob/master/docs/bert.md use
.half()
(FP16)?