UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.89k stars 2.44k forks source link

Bus error when using encode_multi_process #844

Open Jacobh2 opened 3 years ago

Jacobh2 commented 3 years ago

Hi!

I wanted to try the multi processing feature described here and slightly modified one of the examples to run on only CPU:

from sentence_transformers import SentenceTransformer, LoggingHandler
import logging

logging.basicConfig(format='%(asctime)s - %(message)s',
                    datefmt='%Y-%m-%d %H:%M:%S',
                    level=logging.INFO,
                    handlers=[LoggingHandler()])

#Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == '__main__':

    #Create a large list of 100k sentences
    sentences = ["This is sentence {}".format(i) for i in range(100000)]

    #Define the model
    model = SentenceTransformer('stsb-distilbert-base', device='cpu')

    #Start the multi-process pool on all available CUDA devices
    pool = model.start_multi_process_pool(target_devices=['cpu', 'cpu'])

    #Compute the embeddings using the multi-process pool
    emb = model.encode_multi_process(sentences, pool)
    print("Embeddings computed. Shape:", emb.shape)

    #Optional: Stop the proccesses in the pool
    model.stop_multi_process_pool(pool)

I'm running this in a docker container python:3.9.2 and have installed the following:

certifi==2020.12.5
chardet==4.0.0
click==7.1.2
filelock==3.0.12
idna==2.10
joblib==1.0.1
nltk==3.5
numpy==1.20.2
packaging==20.9
Pillow==8.1.2
pyparsing==2.4.7
regex==2021.3.17
requests==2.25.1
sacremoses==0.0.43
scikit-learn==0.24.1
scipy==1.6.2
sentence-transformers==1.0.3
sentencepiece==0.1.95
six==1.15.0
threadpoolctl==2.1.0
tokenizers==0.10.1
torch==1.8.1+cpu
torchaudio==0.8.1
torchvision==0.9.1+cpu
tqdm==4.59.0
transformers==4.4.2
typing-extensions==3.7.4.3
urllib3==1.26.4

by running

pip install torch==1.8.1+cpu torchvision==0.9.1+cpu torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install sentence-transformers

But when I then run the above code, I get the following:

>root@641301ae4d48:/# python test.py
2021-03-30 16:50:22 - Load pretrained SentenceTransformer: stsb-distilbert-base
2021-03-30 16:50:22 - Did not find folder stsb-distilbert-base
2021-03-30 16:50:22 - Search model on server: http://sbert.net/models/stsb-distilbert-base.zip
2021-03-30 16:50:22 - Downloading sentence transformer model from http://sbert.net/models/stsb-distilbert-base.zip and saving it at /root/.cache/torch/sentence_transformers/sbert.net_models_stsb-distilbert-base
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 245M/245M [01:01<00:00, 3.97MB/s]
2021-03-30 16:51:26 - Load SentenceTransformer from folder: /root/.cache/torch/sentence_transformers/sbert.net_models_stsb-distilbert-base
2021-03-30 16:51:28 - Start multi-process pool on devices: cpu, cpu
Bus error
root@641301ae4d48:/# /usr/local/lib/python3.9/multiprocessing/resource_tracker.py:216: UserWarning: resource_tracker: There appear to be 6 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

So it doesn't seem to play ball with the multi processing pool. I've tested to encode the sentences using the simpler encode method and that works as expected.

Is this something related to this issue: https://github.com/huggingface/tokenizers/issues/537?

Any help or insights in how to fix this is highly appreciated. My ultimate goal is to encode a lot of sentences in parallell to speed things up!

nreimers commented 3 years ago

Sadly I don't know why this happens.

If you use docker, it might be easier to partition your dataset into multiple chucks and run multiple containers which work on different parts of your data.

Jacobh2 commented 3 years ago

I continued to try to narrow down what was happening and basically re-implemented this lib's multi-process methods using the python multiprocess package instead of the one by pytorch and made it work. Didn't really see any speedups thought so think that I'll try a different approach.

Your suggestion is really nice and I might actually do something like that if I need more speed (unfortunately I don't have the ability to use a GPU) 👏

nreimers commented 3 years ago

Yes, on a CPU machine you will not see much speed-up, as the operations are already parallelized across the different CPU cores for encoding one batch of sentences. Running a second process will compete for the same CPU time.

okhonko commented 1 year ago

The method SentenceTransformer.start_multi_process_pool() utilizes torch.multiprocessing. Torch has a known issue when running inside Docker, as discussed here: https://github.com/pytorch/pytorch/issues/2244

The proposed solution that suggests adding --shm-size 8G to the docker run command worked for me.

TejasReddyBiophy commented 10 months ago

The method SentenceTransformer.start_multi_process_pool() utilizes torch.multiprocessing. Torch has a known issue when running inside Docker, as discussed here: pytorch/pytorch#2244

The proposed solution that suggests adding --shm-size 8G to the docker run command worked for me.

This worked for me. Thank you very much. Just add that --shm-size 8G before you mention the image name in the docker run command. It dosen't matter how much the over all memory you have allotted for the code in docker/ ec2. Unless we provide a shared memory size, it throws bus error.

michellemay commented 10 months ago

It worked for me as well. However, I'm wondering if this could be caused by the interprocess communication trying to serialize the whole model object (see this line) The model contains the list of modules which can be huge. If this is the case, then, it could be a better idea to defer the loading/initialization of the model object in the sub processes. What do you think?

TejasReddyBiophy commented 10 months ago

It worked for me as well. However, I'm wondering if this could be caused by the interprocess communication trying to serialize the whole model object (see this line) The model contains the list of modules which can be huge. If this is the case, then, it could be a better idea to defer the loading/initialization of the model object in the sub processes. What do you think?

I am not an expert in this area, and this could be a reason. But, my issue particularly was when I was trying to run this code inside a docker. Locally the same code worked well. Another interesting thing I observed was - Docker images have 64 MB as their default shared memory limit - which is ridiculously low for anything. So, unless we explicitly mention the shm size to be high, we won't be able to use multiprocessing with sentence transformers inside a docker, and therefore on an ec2.

michellemay commented 10 months ago

Exactly.. running in docker is only a matter of passing --shm-size=8g to the docker run command. For kubernetes, it's a bit more tricky but it can be done by mapping a memory folder to the /dev/shm device. Although this looks to solve the symptoms of the problem, it does not solve the problem itself: too much use of shared memory. I'm currently testing a modification of the SentenceTransformer to pass in the model name instead of the whole model itself when starting the multiprocess pool. I'll keep you posted if it works.

michellemay commented 10 months ago

I can confirm that my fix is working. So instead of serializing the whole model over shared memory, I send the model name to the worker and the worker load a new copy.

To illustrate the idea (not PR ready), I changed this:

p = ctx.Process(target=SentenceTransformer._encode_multi_process_worker, args=(cuda_id, **self**, input_queue, output_queue), daemon=True)

to:

p = ctx.Process(target=SentenceTransformer._encode_multi_process_worker,  args=(cuda_id, **self._model_name_or_path**, input_queue, output_queue),

And changed the worker function to:

   @staticmethod
    def _encode_multi_process_worker(target_device: str, model_name_or_path: str, input_queue: mp.Queue, results_queue: mp.Queue) -> None:
        model = SentenceTransformer(model_name_or_path)

        while True:
            try:
...

It's running in kubernetes using default 64MB shared memory size using 4 gpus on a g5.12xlarge and it is fast!

bhavika commented 6 months ago

Any luck on a proper fix for this issue? I'm not running in a Docker container and I still see the bus error (core dumped) error. Here's the script I tried:

"""
This example starts multiple processes (1 per GPU), which encode
sentences in parallel. This gives a near linear speed-up
when encoding large text collections.
It also demonstrates how to stream data which is helpful in case you don't
want to wait for an extremely large dataset to download, or if you want to
limit the amount of memory used. More info about dataset streaming:
https://huggingface.co/docs/datasets/stream
"""

from sentence_transformers import SentenceTransformer, LoggingHandler
import logging
from datasets import load_dataset
from torch.utils.data import DataLoader
from tqdm import tqdm

logging.basicConfig(
    format="%(asctime)s - %(message)s", datefmt="%Y-%m-%d %H:%M:%S", level=logging.INFO, handlers=[LoggingHandler()]
)

# Important, you need to shield your code with if __name__. Otherwise, CUDA runs into issues when spawning new processes.
if __name__ == "__main__":
    # Set params
    data_stream_size = 16384  # Size of the data that is loaded into memory at once
    chunk_size = 1024  # Size of the chunks that are sent to each process
    encode_batch_size = 128  # Batch size of the model

    # Load a large dataset in streaming mode. more info: https://huggingface.co/docs/datasets/stream
    dataset = load_dataset("yahoo_answers_topics", split="train", streaming=True)
    dataloader = DataLoader(dataset.with_format("torch"), batch_size=data_stream_size)

    # Define the model
    model = SentenceTransformer("intfloat/e5-small-v2")

    # Start the multi-process pool on all available CUDA devices
    pool = model.start_multi_process_pool()

    for i, batch in enumerate(tqdm(dataloader)):
        # Compute the embeddings using the multi-process pool
        sentences = batch["best_answer"]
        batch_emb = model.encode_multi_process(sentences, pool, chunk_size=chunk_size, batch_size=encode_batch_size)
        print("Embeddings computed for 1 batch. Shape:", batch_emb.shape)

    # Optional: Stop the processes in the pool
    model.stop_multi_process_pool(pool)