Closed a-green-hand-jack closed 4 months ago
I changed the code slightly on my end as I did not have access to data2vec
.
import time
import matplotlib.pyplot as plt
import random
import numpy as np
import torch
from llm2vec import LLM2Vec
def measure_encoding_time(model, documents, batch_sizes):
times = {}
for batch_size in batch_sizes:
start_time = time.time()
outputs = model.encode(sentences=documents, batch_size=batch_size)
print(f"Batch size: {batch_size}, Number of documents: {len(documents)}")
print(f"Output shape: {outputs.shape}")
end_time = time.time()
encoding_time = end_time - start_time
times[batch_size] = encoding_time
print(f"Batch size: {batch_size}, Encoding time: {encoding_time:.2f} seconds")
return times
def plot_encoding_time(times):
batch_sizes = list(times.keys())
encoding_times = list(times.values())
plt.figure(figsize=(10, 5))
plt.plot(batch_sizes, encoding_times, marker="o")
plt.xlabel("Batch Size")
plt.ylabel("Encoding Time (seconds)")
plt.title("Batch Size vs Encoding Time")
plt.xscale("log")
plt.yscale("log")
plt.grid(True, which="both", ls="--")
plt.show()
plt.savefig("batch_size_vs_encoding_time.png")
if __name__ == "__main__":
seed = 42
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
l2v = LLM2Vec.from_pretrained(
"McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
peft_model_name_or_path="McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse",
device_map="cuda" if torch.cuda.is_available() else "cpu",
torch_dtype=torch.bfloat16,
)
documents = [
"This gene's Ensembl id is ENSG00000000003. This gene's summary page is https://www.ncbi.nlm.nih.gov/gene/7105. The gene's summary is Official Symbol TSPAN6provided by HGNC Official Full Name tetraspanin 6provided by HGNC Primary source HGNC:HGNC:11858 See related Ensembl:ENSG00000000003 MIM:300191; AllianceGenome:HGNC:11858 Gene type protein coding RefSeq status REVIEWED Organism Homo sapiens Lineage Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo Also known as T245; TM4SF6; TSPAN-6 Summary The protein encoded by this gene is a member of the transmembrane 4 superfamily, also known as the tetraspanin family. Most of these members are cell-surface proteins that are characterized by the presence of four hydrophobic domains. The proteins mediate signal transduction events that play a role in the regulation of cell development, activation, growth and motility. The protein encoded by this gene is a cell surface glycoprotein and is highly similar in sequence to the transmembrane 4 superfamily member 2 protein. It functions as a negative regulator of retinoic acid-inducible gene I-like receptor-mediated immune signaling via its interaction with the mitochondrial antiviral signaling-centered signalosome. This gene uses alternative polyadenylation sites, and multiple transcript variants result from alternative splicing. [provided by RefSeq, Jul 2013] Expression Ubiquitous expression in colon (RPKM 15.1), ovary (RPKM 15.0) and 24 other tissues See more Orthologs mouse all"
] * 128
batch_sizes_to_test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]
encoding_times = measure_encoding_time(
model=l2v, documents=documents, batch_sizes=batch_sizes_to_test
)
print("Encoding times:", encoding_times)
plot_encoding_time(encoding_times)
The code in run on a single H100, here is the resultant plot
Are you sure the model is being loaded to GPU? Can you try to run the same code that I did, or maybe share data2vec
module?
For CPU, I got this plot
Thank you very much for your reply! I'm sure this calculation is happening on the GPU and not the CPU. I used nvitop to monitor the GPU usage in my operations:
As for the 'data2vec' module you mentioned, in fact, only the load_llm2vec_model
method is used here, and it is used to load l2v
and tokenizer
.Here is the corresponding code:
def load_llm2vec_model(
llm2vec_model_path: str,
base_llm_model_path: str,
max_length: int = 2048,
device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
) -> Tuple[LLM2Vec, PreTrainedTokenizerFast | PreTrainedTokenizer]:
if not all(check_path(p) for p in [llm2vec_model_path, base_llm_model_path]):
raise FileNotFoundError(
"One or more model paths do not exist or are not directories."
)
change_config = {
"path": os.path.join(llm2vec_model_path, "config.json"),
"_name_or_path": base_llm_model_path,
"auto_map": {
"AutoModel": llm2vec_model_path
+ "--modeling_llama_encoder.LlamaEncoderModel"
},
}
modify_json(json_file_path=change_config["path"], change_dict=change_config)
change_adapter = {
"path": os.path.join(llm2vec_model_path, "adapter_config.json"),
"base_model_name_or_path": base_llm_model_path,
}
modify_json(json_file_path=change_adapter["path"], change_dict=change_adapter)
tokenizer = AutoTokenizer.from_pretrained(
pretrained_model_name_or_path=llm2vec_model_path
)
config = AutoConfig.from_pretrained(llm2vec_model_path)
model = AutoModel.from_pretrained(
llm2vec_model_path, config=config, local_files_only=True
)
print(f"Model moved to device: {device}")
# Loading MNTP (Masked Next Token Prediction) model.
model = PeftModel.from_pretrained(
model=model,
model_id=llm2vec_model_path,
)
model.to(device)
# Wrapper for encoding and pooling operations
l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=max_length)
print("LLM2Vec model loaded successfully.")
# print(l2v)
return l2v, tokenizer
The method used here is a bit strange, because I downloaded the LLaMA model in advance to a local folder.
Can you try encoding with different batch sizes using any sentence-transformer models? Similar to LLM2Vec, you have to call encode
function with list of sentences. This will help us determine if it is an llm2vec specific issue or hardware issue.
Can you try encoding with different batch sizes using any sentence-transformer models? Similar to LLM2Vec, you have to call
encode
function with list of sentences. This will help us determine if it is an llm2vec specific issue or hardware issue.
Thanks for the reminder, this is indeed a good way to test! First,I tried a very small model: sentence-transformers/all-MiniLM-L6-v2 (without using LLM2VEC).Here is the final result:
This effect seems to be consistent with common sense, although there is a strange turning point in the middle.
Further, I used nreimers/MiniLM-L6-H384-uncased test (using LLM2VEC) and got the following results:
Considering the size of these models it seems like the problem may be in the hardware?
@a-green-hand-jack Can you try to run Instructor XL. It is a sentence-transformers model similar in size to Sheared Llama so it will be a more suitable comparison.
Closing as it is stale. @a-green-hand-jack - Feel free to re-open if you still need help on this.
Hi, I'm working on how to use llm2vec, it's really an interesting thing to work on! However, in my application, I found a problem that the relationship between batch size and running time is a little different. Generally speaking, the larger the batch size, the shorter the operation time should be, but I have not observed this phenomenon. Here is the code I used to test:
Here is the resulting image:
The GPU I use is an L20 and the video memory size is 48GB. The concern arises that the issue may stem from an excessive text length coupled with inadequate GPU performance.