Multi - GPU inference - Githubissues

ibar2711 commented 3 months ago

Suppose I want to employ a larger model for calculating embeddings such as the SFR-2 by SalesForce. Is there a way to load the model into multiple GPUs? Currently, it seems like only training supports multi - GPU mode but inference doesn't.

anshuchen commented 2 months ago

Are you asking about whether we can distribute the memory load across multiple GPUs? If so, I am curious about that as well. I can't fit these large models onto one GPU, so I'd like to spread the model across multiple GPUs. Not sure if that's possible. I understand that this is possible in the transformers module, which I think sentence-transformers is built on. Maybe there's a way for us to use device_map in sentence-transformers?

ir2718 commented 2 months ago

@ibar2711 @anshuchen

Hi,

there is no completely automatic way to do this, but you can still do it.

First, you should find out the automatic device map for the machine you're using:

from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModel, AutoConfig
from pprint import pprint

model_name = "Salesforce/SFR-Embedding-2_R"
with init_empty_weights():
    config = AutoConfig.from_pretrained(model_name)
    model = AutoModel.from_config(config)
device_map = infer_auto_device_map(model)

pprint(device_map)

In my case, this prints out:

OrderedDict([('embed_tokens', 0),
             ('layers.0', 0),
             ('layers.1', 0),
             ('layers.2', 0),
             ('layers.3', 0),
             ('layers.4', 0),
             ('layers.5', 0),
             ('layers.6', 0),
             ('layers.7', 0),
             ('layers.8', 0),
             ('layers.9', 0),
             ('layers.10', 0),
             ('layers.11', 0),
             ('layers.12.self_attn', 0),
             ('layers.12.mlp.gate_proj', 0),
             ('layers.12.mlp.up_proj', 1),
             ('layers.12.mlp.down_proj', 1),
             ('layers.12.mlp.act_fn', 1),
             ('layers.12.input_layernorm', 1),
             ('layers.12.post_attention_layernorm', 1),
             ('layers.13', 1),
             ('layers.14', 1),
             ('layers.15', 1),
             ('layers.16', 1),
             ('layers.17', 1),
             ('layers.18', 1),
             ('layers.19', 1),
             ('layers.20', 1),
             ('layers.21', 1),
             ('layers.22', 1),
             ('layers.23', 1),
             ('layers.24', 1),
             ('layers.25', 1),
             ('layers.26.self_attn', 1),
             ('layers.26.mlp.gate_proj', 1),
             ('layers.26.mlp.up_proj', 'cpu'),
             ('layers.26.mlp.down_proj', 'cpu'),
             ('layers.26.mlp.act_fn', 'cpu'),
             ('layers.26.input_layernorm', 'cpu'),
             ('layers.26.post_attention_layernorm', 'cpu'),
             ('layers.27', 'cpu'),
             ('layers.28', 'cpu'),
             ('layers.29', 'cpu'),
             ('layers.30', 'cpu'),
             ('layers.31', 'cpu'),
             ('norm', 'cpu')])

Before the next part, you will need to comment out SentenceTransformer.py lines 318, and 541, as these lines will produce an error, eg:

NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.

I'm pretty sure this could be avoided by just checking if the model_args contain the device_map key. I don't know if this is an actual bug or the expected behaviour (@tomaarsen will probably know).

Now, you will need probably need to move some of the gpu modules to the cpu, as you could get issues with the devices:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!

In my case, this works:

from sentence_transformers.models import Transformer, Pooling
from sentence_transformers import SentenceTransformer
from collections import OrderedDict
import torch

model_name = "Salesforce/SFR-Embedding-2_R"

device_map = {}
device_map["embed_tokens"] = 0
device_map["layers.0"] = 0
device_map["layers.1"] = 0
device_map["layers.2"] = 0
device_map["layers.3"] = 0
device_map["layers.4"] = 0
device_map["layers.5"] = 0
device_map["layers.6"] = 0
device_map["layers.7"] = 0
device_map["layers.8"] = 0
device_map["layers.9"] = 0
device_map["layers.10"] = 0
device_map["layers.11"] = "cpu"
device_map["layers.12.self_attn"] = "cpu"
device_map["layers.12.mlp.gate_proj"] = "cpu"
device_map["layers.12.mlp.up_proj"] ="cpu"
device_map["layers.12.mlp.down_proj"] = "cpu"
device_map["layers.12.mlp.act_fn"] ="cpu"
device_map["layers.12.input_layernorm"] ="cpu"
device_map["layers.12.post_attention_layernorm"] ="cpu"
device_map["layers.13"] = 1
device_map["layers.14"] = 1
device_map["layers.15"] = 1
device_map["layers.16"] = 1
device_map["layers.17"] = 1
device_map["layers.18"] = 1
device_map["layers.19"] = 1
device_map["layers.20"] = 1
device_map["layers.21"] = 1
device_map["layers.22"] = 1
device_map["layers.23"] = 1
device_map["layers.24"] = 1
device_map["layers.25"] = "cpu"
device_map["layers.26.self_attn"] = "cpu"
device_map["layers.26.mlp.gate_proj"] = "cpu"
device_map["layers.26.mlp.up_proj"] = "cpu"
device_map["layers.26.mlp.down_proj"] = "cpu"
device_map["layers.26.mlp.act_fn"] = "cpu"
device_map["layers.26.input_layernorm"] = "cpu"
device_map["layers.26.post_attention_layernorm"] = "cpu"
device_map["layers.27"] = "cpu"
device_map["layers.28"] = "cpu"
device_map["layers.29"] = "cpu"
device_map["layers.30"] = "cpu"
device_map["layers.31"] = "cpu"
device_map["norm"] = 0
device_map = OrderedDict(device_map)

model = Transformer(
    model_name_or_path=model_name,
    model_args={
        "device_map":device_map, 
        "offload_folder":"offload", 
        "offload_state_dict":True, 
        "torch_dtype":torch.float16
    }
)
pool = Pooling(word_embedding_dimension=model.get_word_embedding_dimension())

st = SentenceTransformer(modules=[model, pool])

embs = st.encode(["Some text", "Some other text"])

print(embs.shape) # (2, 4096)

Also, keep in mind that the pooling im using is not the pooling the authors use. You can easily implement that using the code from the huggingface model card.

Hope this helps.

anshuchen commented 2 months ago

Thank you so much for the detailed writeup!

UKPLab / sentence-transformers

Multi - GPU inference #2869