Open Jacobh2 opened 3 years ago
Sadly I don't know why this happens.
If you use docker, it might be easier to partition your dataset into multiple chucks and run multiple containers which work on different parts of your data.
I continued to try to narrow down what was happening and basically re-implemented this lib's multi-process methods using the python multiprocess package instead of the one by pytorch and made it work. Didn't really see any speedups thought so think that I'll try a different approach.
Your suggestion is really nice and I might actually do something like that if I need more speed (unfortunately I don't have the ability to use a GPU) 👏
Yes, on a CPU machine you will not see much speed-up, as the operations are already parallelized across the different CPU cores for encoding one batch of sentences. Running a second process will compete for the same CPU time.
The method SentenceTransformer.start_multi_process_pool()
utilizes torch.multiprocessing
. Torch has a known issue when running inside Docker, as discussed here: https://github.com/pytorch/pytorch/issues/2244
The proposed solution that suggests adding --shm-size 8G
to the docker run
command worked for me.
The method
SentenceTransformer.start_multi_process_pool()
utilizestorch.multiprocessing
. Torch has a known issue when running inside Docker, as discussed here: pytorch/pytorch#2244The proposed solution that suggests adding
--shm-size 8G
to thedocker run
command worked for me.
This worked for me. Thank you very much. Just add that --shm-size 8G before you mention the image name in the docker run command. It dosen't matter how much the over all memory you have allotted for the code in docker/ ec2. Unless we provide a shared memory size, it throws bus error.
It worked for me as well. However, I'm wondering if this could be caused by the interprocess communication trying to serialize the whole model object (see this line) The model contains the list of modules which can be huge. If this is the case, then, it could be a better idea to defer the loading/initialization of the model object in the sub processes. What do you think?
It worked for me as well. However, I'm wondering if this could be caused by the interprocess communication trying to serialize the whole model object (see this line) The model contains the list of modules which can be huge. If this is the case, then, it could be a better idea to defer the loading/initialization of the model object in the sub processes. What do you think?
I am not an expert in this area, and this could be a reason. But, my issue particularly was when I was trying to run this code inside a docker. Locally the same code worked well. Another interesting thing I observed was - Docker images have 64 MB as their default shared memory limit - which is ridiculously low for anything. So, unless we explicitly mention the shm size to be high, we won't be able to use multiprocessing with sentence transformers inside a docker, and therefore on an ec2.
Exactly.. running in docker is only a matter of passing --shm-size=8g
to the docker run command. For kubernetes, it's a bit more tricky but it can be done by mapping a memory folder to the /dev/shm device. Although this looks to solve the symptoms of the problem, it does not solve the problem itself: too much use of shared memory. I'm currently testing a modification of the SentenceTransformer to pass in the model name instead of the whole model itself when starting the multiprocess pool. I'll keep you posted if it works.
I can confirm that my fix is working. So instead of serializing the whole model over shared memory, I send the model name to the worker and the worker load a new copy.
To illustrate the idea (not PR ready), I changed this:
p = ctx.Process(target=SentenceTransformer._encode_multi_process_worker, args=(cuda_id, **self**, input_queue, output_queue), daemon=True)
to:
p = ctx.Process(target=SentenceTransformer._encode_multi_process_worker, args=(cuda_id, **self._model_name_or_path**, input_queue, output_queue),
And changed the worker function to:
@staticmethod
def _encode_multi_process_worker(target_device: str, model_name_or_path: str, input_queue: mp.Queue, results_queue: mp.Queue) -> None:
model = SentenceTransformer(model_name_or_path)
while True:
try:
...
It's running in kubernetes using default 64MB shared memory size using 4 gpus on a g5.12xlarge and it is fast!
Any luck on a proper fix for this issue? I'm not running in a Docker container and I still see the bus error (core dumped)
error. Here's the script I tried:
"""
This example starts multiple processes (1 per GPU), which encode
sentences in parallel. This gives a near linear speed-up
when encoding large text collections.
It also demonstrates how to stream data which is helpful in case you don't
want to wait for an extremely large dataset to download, or if you want to
limit the amount of memory used. More info about dataset streaming:
https://huggingface.co/docs/datasets/stream
"""
from sentence_transformers import SentenceTransformer, LoggingHandler
import logging
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm
logging.basicConfig(
format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
)
# Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == "__main__":
# Set params
data_stream_size = 16384 # Size of the data that is loaded into memory at once
chunk_size = 1024 # Size of the chunks that are sent to each process
encode_batch_size = 128 # Batch size of the model
# Load a large dataset in streaming mode. more info: https://huggingface.co/docs/datasets/stream
dataset = load_dataset("yahoo_answers_topics", split="train", streaming=True)
dataloader = DataLoader(dataset.with_format("torch"), batch_size=data_stream_size)
# Define the model
model = SentenceTransformer("intfloat/e5-small-v2")
# Start the multi-process pool on all available CUDA devices
pool = model.start_multi_process_pool()
for i, batch in enumerate(tqdm(dataloader)):
# Compute the embeddings using the multi-process pool
sentences = batch["best_answer"]
batch_emb = model.encode_multi_process(sentences, pool, chunk_size=chunk_size, batch_size=encode_batch_size)
print("Embeddings computed for 1 batch. Shape:", batch_emb.shape)
# Optional: Stop the processes in the pool
model.stop_multi_process_pool(pool)
Hi!
I wanted to try the multi processing feature described here and slightly modified one of the examples to run on only CPU:
I'm running this in a docker container
python:3.9.2
and have installed the following:by running
But when I then run the above code, I get the following:
So it doesn't seem to play ball with the multi processing pool. I've tested to encode the sentences using the simpler
encode
method and that works as expected.Is this something related to this issue: https://github.com/huggingface/tokenizers/issues/537?
Any help or insights in how to fix this is highly appreciated. My ultimate goal is to encode a lot of sentences in parallell to speed things up!