UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.33k stars 2.48k forks source link

Problems with using start_rultisprocess_pool() #2955

Open safwaqf opened 1 month ago

safwaqf commented 1 month ago

Why do I encounter a situation where the sentence list does not match the encoding list when I use start_rultisprocess_pool() to start the process pool and then start Python multithreading eg: batchNum:1 queLen: 100, embLen: 98 batchNum:2 queLen: 100, embLen: 102 batchNum:3 queLen: 100, embLen: 102 batchNum:4 queLen: 100, embLen: 98 You can see that I output the sentence list length and encoding list length for four batches. Why did my first batch encode 2 sentences less, and the two sentences that were encoded less went to the second batch. Similarly, the third batch encoded two extra sentences, and the two extra encoded sentences ran to the fourth batch.

tomaarsen commented 1 month ago

Hello!

Do you start the Python multithreading yourself? That shouldn't be needed. There's normally just 1 queue, and each process will continuously pop from that shared queue until it's empty. These processes will then also push to 1 shared output queue. This queue is sorted afterwards to ensure that we have the same order as the inputs, but we still have just 1 output queue.

So, the usage is:

from sentence_transformers import SentenceTransformer

def main():
    model = SentenceTransformer("all-mpnet-base-v2")
    sentences = ["The weather is so nice!", "It's so sunny outside.", "He's driving to the movie theater.", "She's going to the cinema."] * 1000

    pool = model.start_multi_process_pool()
    embeddings = model.encode_multi_process(sentences, pool)
    model.stop_multi_process_pool(pool)

    print(embeddings.shape)
    # => (4000, 768)

if __name__ == "__main__":
    main()

https://sbert.net/docs/package_reference/sentence_transformer/SentenceTransformer.html?highlight=multi_process#sentence_transformers.SentenceTransformer.encode_multi_process