Effect of batch size on run time

a-green-hand-jack commented 4 months ago

Hi, I'm working on how to use llm2vec, it's really an interesting thing to work on! However, in my application, I found a problem that the relationship between batch size and running time is a little different. Generally speaking, the larger the batch size, the shorter the operation time should be, but I have not observed this phenomenon. Here is the code I used to test：

from data2vec import load_llm2vec_model
import time
import matplotlib.pyplot as plt
import random
import numpy as np
import torch

def measure_encoding_time(model, documents, batch_sizes):
    times = {}
    for batch_size in batch_sizes:
        start_time = time.time()
        outputs = model.encode(sentences=documents, batch_size=batch_size)
        print(f"Batch size: {batch_size}, Number of documents: {len(documents)}")
        print(f"Output shape: {outputs.shape}")
        end_time = time.time()
        encoding_time = end_time - start_time
        times[batch_size] = encoding_time
        print(f"Batch size: {batch_size}, Encoding time: {encoding_time:.2f} seconds")
    return times

def plot_encoding_time(times):
    batch_sizes = list(times.keys())
    encoding_times = list(times.values())

    plt.figure(figsize=(10, 5))
    plt.plot(batch_sizes, encoding_times, marker="o")
    plt.xlabel("Batch Size")
    plt.ylabel("Encoding Time (seconds)")
    plt.title("Batch Size vs Encoding Time")
    plt.xscale("log")
    plt.yscale("log")
    plt.grid(True, which="both", ls="--")
    plt.show()

    plt.savefig("batch_size_vs_encoding_time.png")

if __name__ == "__main__":
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    llm2vec_model_path = "../../pre-train-model/McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp"
    base_llm_model_path = "../../pre-train-model/princeton-nlp/Sheared-LLaMA-1.3B"
    l2v, tokenizer = load_llm2vec_model(
        llm2vec_model_path=llm2vec_model_path,
        base_llm_model_path=base_llm_model_path,
        max_length=2048,
    )
    documents = [
        "This gene's Ensembl id is ENSG00000000003. This gene's summary page is https://www.ncbi.nlm.nih.gov/gene/7105. The gene's summary is Official Symbol TSPAN6provided by HGNC Official Full Name tetraspanin 6provided by HGNC Primary source HGNC:HGNC:11858 See related Ensembl:ENSG00000000003 MIM:300191; AllianceGenome:HGNC:11858 Gene type protein coding RefSeq status REVIEWED Organism Homo sapiens Lineage Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo Also known as T245; TM4SF6; TSPAN-6 Summary The protein encoded by this gene is a member of the transmembrane 4 superfamily, also known as the tetraspanin family. Most of these members are cell-surface proteins that are characterized by the presence of four hydrophobic domains. The proteins mediate signal transduction events that play a role in the regulation of cell development, activation, growth and motility. The protein encoded by this gene is a cell surface glycoprotein and is highly similar in sequence to the transmembrane 4 superfamily member 2 protein. It functions as a negative regulator of retinoic acid-inducible gene I-like receptor-mediated immune signaling via its interaction with the mitochondrial antiviral signaling-centered signalosome. This gene uses alternative polyadenylation sites, and multiple transcript variants result from alternative splicing. [provided by RefSeq, Jul 2013] Expression Ubiquitous expression in colon (RPKM 15.1), ovary (RPKM 15.0) and 24 other tissues See more Orthologs mouse all"
    ] * 128

    batch_sizes_to_test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

    encoding_times = measure_encoding_time(
        model=l2v, documents=documents, batch_sizes=batch_sizes_to_test
    )

    print("Encoding times:", encoding_times)

    plot_encoding_time(encoding_times)

Here is the resulting image:

The GPU I use is an L20 and the video memory size is 48GB. The concern arises that the issue may stem from an excessive text length coupled with inadequate GPU performance.

vaibhavad commented 4 months ago

I changed the code slightly on my end as I did not have access to data2vec.


import time
import matplotlib.pyplot as plt
import random
import numpy as np
import torch
from llm2vec import LLM2Vec

def measure_encoding_time(model, documents, batch_sizes):
    times = {}
    for batch_size in batch_sizes:
        start_time = time.time()
        outputs = model.encode(sentences=documents, batch_size=batch_size)
        print(f"Batch size: {batch_size}, Number of documents: {len(documents)}")
        print(f"Output shape: {outputs.shape}")
        end_time = time.time()
        encoding_time = end_time - start_time
        times[batch_size] = encoding_time
        print(f"Batch size: {batch_size}, Encoding time: {encoding_time:.2f} seconds")
    return times

def plot_encoding_time(times):
    batch_sizes = list(times.keys())
    encoding_times = list(times.values())

    plt.figure(figsize=(10, 5))
    plt.plot(batch_sizes, encoding_times, marker="o")
    plt.xlabel("Batch Size")
    plt.ylabel("Encoding Time (seconds)")
    plt.title("Batch Size vs Encoding Time")
    plt.xscale("log")
    plt.yscale("log")
    plt.grid(True, which="both", ls="--")
    plt.show()

    plt.savefig("batch_size_vs_encoding_time.png")

if __name__ == "__main__":
    seed = 42
    random.seed(seed)
    np.random.seed(seed)
    torch.manual_seed(seed)

    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    l2v = LLM2Vec.from_pretrained(
        "McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp",
        peft_model_name_or_path="McGill-NLP/LLM2Vec-Sheared-LLaMA-mntp-unsup-simcse",
        device_map="cuda" if torch.cuda.is_available() else "cpu",
        torch_dtype=torch.bfloat16,
    )
    documents = [
        "This gene's Ensembl id is ENSG00000000003. This gene's summary page is https://www.ncbi.nlm.nih.gov/gene/7105. The gene's summary is Official Symbol TSPAN6provided by HGNC Official Full Name tetraspanin 6provided by HGNC Primary source HGNC:HGNC:11858 See related Ensembl:ENSG00000000003 MIM:300191; AllianceGenome:HGNC:11858 Gene type protein coding RefSeq status REVIEWED Organism Homo sapiens Lineage Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini; Catarrhini; Hominidae; Homo Also known as T245; TM4SF6; TSPAN-6 Summary The protein encoded by this gene is a member of the transmembrane 4 superfamily, also known as the tetraspanin family. Most of these members are cell-surface proteins that are characterized by the presence of four hydrophobic domains. The proteins mediate signal transduction events that play a role in the regulation of cell development, activation, growth and motility. The protein encoded by this gene is a cell surface glycoprotein and is highly similar in sequence to the transmembrane 4 superfamily member 2 protein. It functions as a negative regulator of retinoic acid-inducible gene I-like receptor-mediated immune signaling via its interaction with the mitochondrial antiviral signaling-centered signalosome. This gene uses alternative polyadenylation sites, and multiple transcript variants result from alternative splicing. [provided by RefSeq, Jul 2013] Expression Ubiquitous expression in colon (RPKM 15.1), ovary (RPKM 15.0) and 24 other tissues See more Orthologs mouse all"
    ] * 128

    batch_sizes_to_test = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096]

    encoding_times = measure_encoding_time(
        model=l2v, documents=documents, batch_sizes=batch_sizes_to_test
    )

    print("Encoding times:", encoding_times)

    plot_encoding_time(encoding_times)

The code in run on a single H100, here is the resultant plot batch_size_vs_encoding_time

Are you sure the model is being loaded to GPU? Can you try to run the same code that I did, or maybe share data2vec module?

vaibhavad commented 4 months ago

For CPU, I got this plot

batch_size_vs_encoding_time

a-green-hand-jack commented 4 months ago

Thank you very much for your reply! I'm sure this calculation is happening on the GPU and not the CPU. I used nvitop to monitor the GPU usage in my operations:

As for the 'data2vec' module you mentioned, in fact, only the load_llm2vec_model method is used here, and it is used to load l2v and tokenizer.Here is the corresponding code:

def load_llm2vec_model(
    llm2vec_model_path: str,
    base_llm_model_path: str,
    max_length: int = 2048,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
) -> Tuple[LLM2Vec, PreTrainedTokenizerFast | PreTrainedTokenizer]:
    if not all(check_path(p) for p in [llm2vec_model_path, base_llm_model_path]):
        raise FileNotFoundError(
            "One or more model paths do not exist or are not directories."
        )

    change_config = {
        "path": os.path.join(llm2vec_model_path, "config.json"),
        "_name_or_path": base_llm_model_path,
        "auto_map": {
            "AutoModel": llm2vec_model_path
            + "--modeling_llama_encoder.LlamaEncoderModel"
        },
    }
    modify_json(json_file_path=change_config["path"], change_dict=change_config)

    change_adapter = {
        "path": os.path.join(llm2vec_model_path, "adapter_config.json"),
        "base_model_name_or_path": base_llm_model_path,
    }
    modify_json(json_file_path=change_adapter["path"], change_dict=change_adapter)

    tokenizer = AutoTokenizer.from_pretrained(
        pretrained_model_name_or_path=llm2vec_model_path
    )
    config = AutoConfig.from_pretrained(llm2vec_model_path)

    model = AutoModel.from_pretrained(
        llm2vec_model_path, config=config, local_files_only=True
    )

    print(f"Model moved to device: {device}")

    # Loading MNTP (Masked Next Token Prediction) model.
    model = PeftModel.from_pretrained(
        model=model,
        model_id=llm2vec_model_path,
    )
    model.to(device)
    # Wrapper for encoding and pooling operations
    l2v = LLM2Vec(model, tokenizer, pooling_mode="mean", max_length=max_length)
    print("LLM2Vec model loaded successfully.")
    # print(l2v)

    return l2v, tokenizer

The method used here is a bit strange, because I downloaded the LLaMA model in advance to a local folder.

vaibhavad commented 4 months ago

Can you try encoding with different batch sizes using any sentence-transformer models? Similar to LLM2Vec, you have to call encode function with list of sentences. This will help us determine if it is an llm2vec specific issue or hardware issue.

a-green-hand-jack commented 4 months ago

Can you try encoding with different batch sizes using any sentence-transformer models? Similar to LLM2Vec, you have to call encode function with list of sentences. This will help us determine if it is an llm2vec specific issue or hardware issue.

Thanks for the reminder, this is indeed a good way to test! First,I tried a very small model: sentence-transformers/all-MiniLM-L6-v2 (without using LLM2VEC).Here is the final result:

This effect seems to be consistent with common sense, although there is a strange turning point in the middle.

Further, I used nreimers/MiniLM-L6-H384-uncased test (using LLM2VEC) and got the following results:

Considering the size of these models it seems like the problem may be in the hardware?

vaibhavad commented 4 months ago

@a-green-hand-jack Can you try to run Instructor XL. It is a sentence-transformers model similar in size to Sheared Llama so it will be a more suitable comparison.

vaibhavad commented 4 months ago

Closing as it is stale. @a-green-hand-jack - Feel free to re-open if you still need help on this.

McGill-NLP / llm2vec

Effect of batch size on run time #47