UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

Why memory increases during training #2602

Open miloskovacevic68 opened 4 months ago

miloskovacevic68 commented 4 months ago

Hello, I have an (anchor, positive) unlabeled dataset with around 250,000 examples. Here is the code I use to fine-tune the sentence-transformers/multi-qa-mpnet-base-cos-v1 model on a subset of MS Marco dataset:

def create_anchor_positive_unlabeled_set(marko_csv): 
    training_set = []
    with open(marko_csv) as f:
        for l in f:
            try:
                qaid, q, pos, neg, ans = l.split("\t")
                training_set.append(InputExample(texts=[q, pos]))
            except:
                pass

    return training_set

def finetune_with_mnr_loss(training_set, model, output_dir, batch_size, epochs, max_seq_len):   
    model.max_seq_length = max_seq_len
    train_dataloader = DataLoader(
        training_set,
        shuffle=True,
        batch_size=batch_size
    )

    train_loss = losses.MultipleNegativesRankingLoss(model=model)
    model.fit(
        [(train_dataloader, train_loss)],
        epochs=epochs,
        output_path=output_dir,
        show_progress_bar=False,
        use_amp=True
    )

finetune_with_mnr_loss(
    training_set=create_anchor_positive_unlabeled_set(
        "datasets/ms_marco_sr.csv"
    ),
    model=SentenceTransformer("sentence-transformers/multi-qa-mpnet-base-cos-v1", device="cuda"),
    output_dir="models/multi_qa_cos_big",
    batch_size=100,
    epochs=10,
    max_seq_len=256
)

When the training starts, 22 of 24GB of my VRAM is consumed. The memory consumption increases during iterations and at the very end of the first epoch I got the Out of Memory error.

I then tried to use Data Loader with IterableDataset but the result is the same. Why the memory increases towards the end of the epoch and how to fine tune this model?

Regards, Milos

ir2718 commented 4 months ago

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

tomaarsen commented 4 months ago

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.

miloskovacevic68 commented 4 months ago

Hi,

Not sure why the memory consumption increases, although you can just lower the batch size (eg. to 64) and it should lower the memory consumption as well.

I would like to have a larger batch size, it seems that the models are better in that case. I'll try with CachedMultipleNegativesRankingLoss that allows for larger batches using smaller mini batches .

Is it possible to train the model on two GPUs?

miloskovacevic68 commented 4 months ago

Hello!

I'm also not quite sure, but I have noticed that sometimes the memory usage can increase. The reasoning is that during tokenization, we pad to the largest sample in the batch, up to the maximum sequence length. So, every time that you encounter a batch with a text that is longer than any text from any previous batch, then the memory usage goes up. After all, it has to put more values for that batch on the GPU.

So, if you reached a particularly long text near the end of the training loop, this can result in a memory usage spike.

In short, if a text from one of your batches exceeds the maximum sequence length, then the batch will be as big as it can possibly be. That will be the maximum memory usage that the training should take.

* Tom Aarsen

It makes sense. Thanks.

tomaarsen commented 4 months ago

Is it possible to train the model on two GPUs?

Only via #2449 at this point. This PR will be merged and released as Sentence Transformers v3 soon.