Open selephantjy opened 3 years ago
CPUs can provide quite different support for floating point operations, which are heavily needed for Transformer networks. I don't know about your two specific CPUs, but it might that the one has optimized architecture for float operations.
What you can try is quantization: https://github.com/UKPLab/sentence-transformers/blob/master/examples/training/distillation/README.md
Also see this PR how we get quite different performances depending on the CPU: https://github.com/UKPLab/sentence-transformers/pull/777
Thank you very much! I tried quantization and it gives 10% speed up. We finally found that on 8C 6151 machine the code automatically uses multithread, while on 8C 6266C we need to manually set torch.set_num_threads(8) for multithread, and after adding this line the inference time is very similar on two different CPUs.
Hello, we find that in inference time of your models varies a lot on different CPUs, for example on Intel(R) Xeon(R) Gold 6266C it's 5 times slower than Intel(R) Xeon(R) Gold 6151, do you know why is that? Is it because of the models inference optimization or because of the calculation performance optimization by certain CPUs? We compared the two CPUs and didn't see much difference. Model Frequency L1d L1i L2 L3 Intel(R) Xeon(R) Gold 6151 3.00GHz 32KB 32KB 1MB 25MB Intel(R) Xeon(R) Gold 6266C 3.00GHz 32KB 32KB 1MB 30MB