Open tianchiguaixia opened 2 years ago
The threads fight for the same CPUs. Encoding is done in parallel, so adding another thread will decrease the encoding speed for each thread as you they fight for the same CPUs
Thank you, I also want to ask, if I increase the server and deploy more encode services, can I reduce the concurrency time? If you only optimize from the code algorithm level, how can you reduce the time of encode concurrent requests on a server? I look forward to your good suggestions.
As mentioned, the bottle neck is the number of CPUs you have.
When your server has 4 physical (not logical!) CPUs, all 4 cores are used to encode an input even when you just use 1 worker.
When you now run 4 workers, all 4 workers fight for the same 4 cores. Effectively, encoding time will be about 4 times slower per worker plus additional overhead as all workers fight for the same number of CPUs.
I can recommend this blog post: https://huggingface.co/blog/bert-cpu-scaling-part-1 https://huggingface.co/blog/bert-cpu-scaling-part-2
So it makes sense to:
Thank you. During the stress test, the CPU occupancy rate of my single server reached 80%, but my memory usage hardly changed. I would like to ask, is it possible to increase memory consumption to reduce concurrency latency? Because only one encode is set on my single server. Is it possible to set multiple encodes to increase memory to reduce latency?
No, that is not possible. But you could create an embeddings cache where you store the embeddings for inputs that appear frequently
Thank you very much for your reply. Currently, my server has only one physical CPU, 10 cores, and 20 threads. This configuration cannot be solved according to your first answer method. For a physical CPU, what can reduce the encoding time when it is concurrent?
10 cores is quite good. You could run 5 workers, each using 2 cores
Thank you very much. How to set run 5 workers, each using 2 cores in the encode method of SentenceTransformer. My current method is to use nginx to open five services (5001-5005), each port has an encode service, and the five ports are requested in a balanced manner. It is found that 200 threads are concurrently executed and the delay is very high. I especially want to know where you are talking about setting run 5 workers, each using 2 cores
Have a look at the linked articles how to limit processes to specific CPU cores
You are too strong, according to your method, successfully reduced the concurrency time delay.
Thank you very much, I would also like to ask, 1. If the sentence-transformers model is changed to the onnx model, will the speed be faster? 2. How to change the sentence-transformers model to onnx
Never tested ONNX. See there PRs here, there is an example on the conversion
Dears, when I use the albert model Example: model=SentenceTransformer('voidful/albert_chinese_tiny',device="cpu") embedding=model.encode(["I am good"]), a single data encoding request, as long as 7-15ms. When I increase the amount of concurrency (5s200 threads), when I request it again, it takes 200-1000ms for each data. Is there any way to reduce the time of concurrent encoding?