UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.23k stars 2.47k forks source link

concurrent encoding time #1321

Open tianchiguaixia opened 2 years ago

tianchiguaixia commented 2 years ago

Dears, when I use the albert model Example: model=SentenceTransformer('voidful/albert_chinese_tiny',device="cpu") embedding=model.encode(["I am good"]), a single data encoding request, as long as 7-15ms. When I increase the amount of concurrency (5s200 threads), when I request it again, it takes 200-1000ms for each data. Is there any way to reduce the time of concurrent encoding?

nreimers commented 2 years ago

The threads fight for the same CPUs. Encoding is done in parallel, so adding another thread will decrease the encoding speed for each thread as you they fight for the same CPUs

tianchiguaixia commented 2 years ago

Thank you, I also want to ask, if I increase the server and deploy more encode services, can I reduce the concurrency time? If you only optimize from the code algorithm level, how can you reduce the time of encode concurrent requests on a server? I look forward to your good suggestions.

nreimers commented 2 years ago

As mentioned, the bottle neck is the number of CPUs you have.

When your server has 4 physical (not logical!) CPUs, all 4 cores are used to encode an input even when you just use 1 worker.

When you now run 4 workers, all 4 workers fight for the same 4 cores. Effectively, encoding time will be about 4 times slower per worker plus additional overhead as all workers fight for the same number of CPUs.

I can recommend this blog post: https://huggingface.co/blog/bert-cpu-scaling-part-1 https://huggingface.co/blog/bert-cpu-scaling-part-2

So it makes sense to:

tianchiguaixia commented 2 years ago

Thank you. During the stress test, the CPU occupancy rate of my single server reached 80%, but my memory usage hardly changed. I would like to ask, is it possible to increase memory consumption to reduce concurrency latency? Because only one encode is set on my single server. Is it possible to set multiple encodes to increase memory to reduce latency?

nreimers commented 2 years ago

No, that is not possible. But you could create an embeddings cache where you store the embeddings for inputs that appear frequently

tianchiguaixia commented 2 years ago

Thank you very much for your reply. Currently, my server has only one physical CPU, 10 cores, and 20 threads. This configuration cannot be solved according to your first answer method. For a physical CPU, what can reduce the encoding time when it is concurrent?

nreimers commented 2 years ago

10 cores is quite good. You could run 5 workers, each using 2 cores

tianchiguaixia commented 2 years ago

Thank you very much. How to set run 5 workers, each using 2 cores in the encode method of SentenceTransformer. My current method is to use nginx to open five services (5001-5005), each port has an encode service, and the five ports are requested in a balanced manner. It is found that 200 threads are concurrently executed and the delay is very high. I especially want to know where you are talking about setting run 5 workers, each using 2 cores

nreimers commented 2 years ago

Have a look at the linked articles how to limit processes to specific CPU cores

tianchiguaixia commented 2 years ago

You are too strong, according to your method, successfully reduced the concurrency time delay.

tianchiguaixia commented 2 years ago

Thank you very much, I would also like to ask, 1. If the sentence-transformers model is changed to the onnx model, will the speed be faster? 2. How to change the sentence-transformers model to onnx

nreimers commented 2 years ago

Never tested ONNX. See there PRs here, there is an example on the conversion