Open safwaqf opened 1 month ago
Hello!
Do you start the Python multithreading yourself? That shouldn't be needed. There's normally just 1 queue, and each process will continuously pop from that shared queue until it's empty. These processes will then also push to 1 shared output queue. This queue is sorted afterwards to ensure that we have the same order as the inputs, but we still have just 1 output queue.
So, the usage is:
from sentence_transformers import SentenceTransformer
def main():
model = SentenceTransformer("all-mpnet-base-v2")
sentences = ["The weather is so nice!", "It's so sunny outside.", "He's driving to the movie theater.", "She's going to the cinema."] * 1000
pool = model.start_multi_process_pool()
embeddings = model.encode_multi_process(sentences, pool)
model.stop_multi_process_pool(pool)
print(embeddings.shape)
# => (4000, 768)
if __name__ == "__main__":
main()
Why do I encounter a situation where the sentence list does not match the encoding list when I use start_rultisprocess_pool() to start the process pool and then start Python multithreading eg: batchNum:1 queLen: 100, embLen: 98 batchNum:2 queLen: 100, embLen: 102 batchNum:3 queLen: 100, embLen: 102 batchNum:4 queLen: 100, embLen: 98 You can see that I output the sentence list length and encoding list length for four batches. Why did my first batch encode 2 sentences less, and the two sentences that were encoded less went to the second batch. Similarly, the third batch encoded two extra sentences, and the two extra encoded sentences ran to the fourth batch.