Open ibar2711 opened 3 months ago
Are you asking about whether we can distribute the memory load across multiple GPUs? If so, I am curious about that as well. I can't fit these large models onto one GPU, so I'd like to spread the model across multiple GPUs. Not sure if that's possible.
I understand that this is possible in the transformers
module, which I think sentence-transformers is built on. Maybe there's a way for us to use device_map
in sentence-transformers?
@ibar2711 @anshuchen
Hi,
there is no completely automatic way to do this, but you can still do it.
First, you should find out the automatic device map for the machine you're using:
from accelerate import infer_auto_device_map, init_empty_weights
from transformers import AutoModel, AutoConfig
from pprint import pprint
model_name = "Salesforce/SFR-Embedding-2_R"
with init_empty_weights():
config = AutoConfig.from_pretrained(model_name)
model = AutoModel.from_config(config)
device_map = infer_auto_device_map(model)
pprint(device_map)
In my case, this prints out:
OrderedDict([('embed_tokens', 0),
('layers.0', 0),
('layers.1', 0),
('layers.2', 0),
('layers.3', 0),
('layers.4', 0),
('layers.5', 0),
('layers.6', 0),
('layers.7', 0),
('layers.8', 0),
('layers.9', 0),
('layers.10', 0),
('layers.11', 0),
('layers.12.self_attn', 0),
('layers.12.mlp.gate_proj', 0),
('layers.12.mlp.up_proj', 1),
('layers.12.mlp.down_proj', 1),
('layers.12.mlp.act_fn', 1),
('layers.12.input_layernorm', 1),
('layers.12.post_attention_layernorm', 1),
('layers.13', 1),
('layers.14', 1),
('layers.15', 1),
('layers.16', 1),
('layers.17', 1),
('layers.18', 1),
('layers.19', 1),
('layers.20', 1),
('layers.21', 1),
('layers.22', 1),
('layers.23', 1),
('layers.24', 1),
('layers.25', 1),
('layers.26.self_attn', 1),
('layers.26.mlp.gate_proj', 1),
('layers.26.mlp.up_proj', 'cpu'),
('layers.26.mlp.down_proj', 'cpu'),
('layers.26.mlp.act_fn', 'cpu'),
('layers.26.input_layernorm', 'cpu'),
('layers.26.post_attention_layernorm', 'cpu'),
('layers.27', 'cpu'),
('layers.28', 'cpu'),
('layers.29', 'cpu'),
('layers.30', 'cpu'),
('layers.31', 'cpu'),
('norm', 'cpu')])
Before the next part, you will need to comment out SentenceTransformer.py lines 318, and 541, as these lines will produce an error, eg:
NotImplementedError: Cannot copy out of meta tensor; no data! Please use torch.nn.Module.to_empty() instead of torch.nn.Module.to() when moving module from meta to a different device.
I'm pretty sure this could be avoided by just checking if the model_args contain the device_map
key. I don't know if this is an actual bug or the expected behaviour (@tomaarsen will probably know).
Now, you will need probably need to move some of the gpu modules to the cpu, as you could get issues with the devices:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0!
In my case, this works:
from sentence_transformers.models import Transformer, Pooling
from sentence_transformers import SentenceTransformer
from collections import OrderedDict
import torch
model_name = "Salesforce/SFR-Embedding-2_R"
device_map = {}
device_map["embed_tokens"] = 0
device_map["layers.0"] = 0
device_map["layers.1"] = 0
device_map["layers.2"] = 0
device_map["layers.3"] = 0
device_map["layers.4"] = 0
device_map["layers.5"] = 0
device_map["layers.6"] = 0
device_map["layers.7"] = 0
device_map["layers.8"] = 0
device_map["layers.9"] = 0
device_map["layers.10"] = 0
device_map["layers.11"] = "cpu"
device_map["layers.12.self_attn"] = "cpu"
device_map["layers.12.mlp.gate_proj"] = "cpu"
device_map["layers.12.mlp.up_proj"] ="cpu"
device_map["layers.12.mlp.down_proj"] = "cpu"
device_map["layers.12.mlp.act_fn"] ="cpu"
device_map["layers.12.input_layernorm"] ="cpu"
device_map["layers.12.post_attention_layernorm"] ="cpu"
device_map["layers.13"] = 1
device_map["layers.14"] = 1
device_map["layers.15"] = 1
device_map["layers.16"] = 1
device_map["layers.17"] = 1
device_map["layers.18"] = 1
device_map["layers.19"] = 1
device_map["layers.20"] = 1
device_map["layers.21"] = 1
device_map["layers.22"] = 1
device_map["layers.23"] = 1
device_map["layers.24"] = 1
device_map["layers.25"] = "cpu"
device_map["layers.26.self_attn"] = "cpu"
device_map["layers.26.mlp.gate_proj"] = "cpu"
device_map["layers.26.mlp.up_proj"] = "cpu"
device_map["layers.26.mlp.down_proj"] = "cpu"
device_map["layers.26.mlp.act_fn"] = "cpu"
device_map["layers.26.input_layernorm"] = "cpu"
device_map["layers.26.post_attention_layernorm"] = "cpu"
device_map["layers.27"] = "cpu"
device_map["layers.28"] = "cpu"
device_map["layers.29"] = "cpu"
device_map["layers.30"] = "cpu"
device_map["layers.31"] = "cpu"
device_map["norm"] = 0
device_map = OrderedDict(device_map)
model = Transformer(
model_name_or_path=model_name,
model_args={
"device_map":device_map,
"offload_folder":"offload",
"offload_state_dict":True,
"torch_dtype":torch.float16
}
)
pool = Pooling(word_embedding_dimension=model.get_word_embedding_dimension())
st = SentenceTransformer(modules=[model, pool])
embs = st.encode(["Some text", "Some other text"])
print(embs.shape) # (2, 4096)
Also, keep in mind that the pooling im using is not the pooling the authors use. You can easily implement that using the code from the huggingface model card.
Hope this helps.
Thank you so much for the detailed writeup!
Suppose I want to employ a larger model for calculating embeddings such as the SFR-2 by SalesForce. Is there a way to load the model into multiple GPUs? Currently, it seems like only training supports multi - GPU mode but inference doesn't.