UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
14.78k stars 2.43k forks source link

No training and validation loss for MNR and CachedMNR loss function #2827

Open Sc-arsgan opened 1 month ago

Sc-arsgan commented 1 month ago

I am trying to finetune bge embedding model for my custom dataset. I have used both MNR and CachedMNR loss function , but I am not getting any training or validation loss value while training , it prints "No Log" instead of loss value . My training dataset has very low queries but data corpus is a lot larger (mostly comprising hard negatives , more than 90%) . Is there any parameters I am missing which would eventually help me to get loss values . Warmup steps : 14 evaluation steps : 2 epochs : 20

Also , please suggest a method where I can use maximum amount of hard negatives for each step . I know number of hard negatives taken into account is based on batch size and with CachedMNR , I can really push batch size to 3500(hard negatives count) . But the accuracy metrics in this case are too low , Do I need to increase the epochs in this case ?

image

bobox2997 commented 1 month ago

Also , please suggest a method where I can use maximum amount of hard negatives for each step . I know number of hard negatives taken into account is based on batch size and with CachedMNR , I can really push batch size to 3500(hard negatives count) . But the accuracy metrics in this case are too low , Do I need to increase the epochs in this case ?

Not totally related (I can't help you with the "no log" issue, I just can ask you which value did you set for "logging steps") About the low accuracy... What learning rate value are you using with such large batch size? Also, less related, but may be worth to try chachedGISTembedd loss instead of cachedMNRL (just in case the low accuracy is a result of some "in batch" negative being more similar to the anchor than the positive). Also, are you using the "no duplicate" batch sampler?

Sc-arsgan commented 1 month ago

Also , please suggest a method where I can use maximum amount of hard negatives for each step . I know number of hard negatives taken into account is based on batch size and with CachedMNR , I can really push batch size to 3500(hard negatives count) . But the accuracy metrics in this case are too low , Do I need to increase the epochs in this case ?

Not totally related (I can't help you with the "no log" issue, I just can ask you which value did you set for "logging steps") About the low accuracy... What learning rate value are you using with such large batch size? Also, less related, but may be worth to try chachedGISTembedd loss instead of cachedMNRL (just in case the low accuracy is a result of some "in batch" negative being more similar to the anchor than the positive). Also, are you using the "no duplicate" batch sampler?

There was no logging_steps argument that I could see . But on a closer look on trainer.py script of transformers(which sentence transformer is eventually using) , I found logging_steps argument getting used . I manually added logging_steps =5 in the argument list while initializing SentenceTransformerTrainingArguments argument list in fit_mixing.py file

image

By doing so I am able to get train loss value but not validation loss value .

image

Regarding other questions learning rate : using default value - 2e-5 batch sampler : default sampler - batch_sampler

cachedGISTembedd loss looks better than MNR and CachedMNR loss function . Will definitely try this loss function. Any ideas regarding the batch size for a small queries dataset with high amount of Hard negatives ? With low batchsize only a few hard negatives are considered in a single step ?

bobox2997 commented 1 month ago

There was no logging_steps argument that I could see . But on a closer look on trainer.py script of transformers(which sentence transformer is eventually using) , I found logging_steps argument getting used .

you can take a look here for a more easy overview of all arguments of sentence transformers: https://sbert.net/docs/package_reference/sentence_transformer/training_args.html?highlight=sentencetransformertrainingarguments#sentence_transformers.training_args.SentenceTransformerTrainingArguments (still suggest to keep logging steps under eval steps)

Regarding other questions learning rate : using default value - 2e-5 batch sampler : default sampler - batch_sampler

Regarding the sampler, usually MNRL (and the cached version) are used with the no duplicate batch sampler, there are no downside as far I know (a duplicate sample in the batch hurt accuracy since its positive would be used as negative for other anchors, including the duplicate one)

have you tried increasing the learning rate and/or using a cosine or "cosine with restarts" lr scheduler (that can help you to manage a higher learning rate)?

Have you tried to decrease batch size?

How many unique queries do you have in your dataset? How is it structured?

what base model are you using?

Edit: Oh, I noticed you are using a multi dataset approach. I missed that. While using that I also experienced the "no log" for validation loss, but I believe it is related to the fact that you get the evaluation loss for each eval dataset you provide. Maybe @tomaarsen could help you better than me

Other question... Are you using the same loss for all the datasets? Is there a specific reason for that you choose round Robin instead of proportional?

Sc-arsgan commented 1 month ago

Have you tried to decrease batch size? How many unique queries do you have in your dataset? How is it structured?

Yes , I tried batch size 10 and went upto 32/64 . No point going more than that as total unique queries I have are less than 70. Performance was kind of similar . The issue is with hard negatives , I have a huge number of hard negatives in dataset which model need to learn not take as a context for the unique queries given . Probably the reason for stagnant accuracy metrics. Structure : {queries:{ } , corpus : { } , relevant_docs : { } }

what base model are you using?

Not a fixed model , I am trying with different embedding models . bge embedding models are quite good and could give me best results

Oh, I noticed you are using a multi dataset approach. I missed that. While using that I also experienced the "no log" for validation loss, but I believe it is related to the fact that you get the evaluation loss for each eval dataset you provide.

Still dont get the reason why we wont get a loss value for validation/eval dataset provided .

Other question... Are you using the same loss for all the datasets? Is there a specific reason for that you choose round Robin instead of proportional?

I only have one dataset . And using multiple loss functions . For now cachedGISTembedd has performed slightly better than other , Will be looking at MultipleNegativesSymmetricRankingLoss next.

bobox2997 commented 1 month ago

I only have one dataset . And using multiple loss functions .

so why there is MultipleDatasetsBatchSamplers.ROUND_ROBIN in your code?

For now cachedGISTembedd has performed slightly better than other , Will be looking at MultipleNegativesSymmetricRankingLoss next.

i don't think the issue is in the loss funtion

Sc-arsgan commented 1 month ago

so why there is MultipleDatasetsBatchSamplers.ROUND_ROBIN in your code?

It is just a default argument inside fit_mixin.py , not getting used while training.

i don't think the issue is in the loss funtion

is it lower count of unique queries ?

tomaarsen commented 1 month ago

Hello!

Apologies for the delay, I've been recovering from a surgery this last month.

For some background: Since the v3.0 update for Sentence Transformers, you can train these models in 2 ways:

  1. Via a SentenceTransformerTrainer introduced in v3.0
  2. Via model.fit: This is the <v3.0 approach that still works, but was updated to use SentenceTransformerTrainer behind the scenes.

This SentenceTransformerTrainer accepts 2 types of optional evaluation which will both be executed (if they are provided) every eval_steps steps:

What you're seeing is that you have an evaluator, but not an eval_dataset: https://github.com/UKPLab/sentence-transformers/blob/c0fc0e8238f7f48a1e92dc90f6f96c86f69f1e02/sentence_transformers/fit_mixin.py#L356-L365

The reason is that this is the backwards-compatible model.fit method, and this method never accepted an evaluation dataset (you couldn't get a validation loss before the v3.0 update). With other words: if you want to get a validation loss, then you have to use the new SentenceTransformerTrainer training approach. There is documentation on this training approach here: https://sbert.net/docs/sentence_transformer/training_overview.html

Otherwise, you can rely on the results from the evaluator (e.g. the cosine accuracy@1) to help you figure out when you're e.g. overfitting. Hope this clarifies things a bit.

Sc-arsgan commented 1 month ago

Hello!

Apologies for the delay, I've been recovering from a surgery this last month.

For some background: Since the v3.0 update for Sentence Transformers, you can train these models in 2 ways:

  1. Via a SentenceTransformerTrainer introduced in v3.0
  2. Via model.fit: This is the <v3.0 approach that still works, but was updated to use SentenceTransformerTrainer behind the scenes.

This SentenceTransformerTrainer accepts 2 types of optional evaluation which will both be executed (if they are provided) every eval_steps steps:

  • evaluator: One of the evaluators from this documentation, which outputs values such as cosine accuracy@1, cosine precision@5, etc.
  • eval_dataset: A Dataset with the same format as the training data. This is fed through the loss function, producing a validation/evaluation loss.

What you're seeing is that you have an evaluator, but not an eval_dataset:

https://github.com/UKPLab/sentence-transformers/blob/c0fc0e8238f7f48a1e92dc90f6f96c86f69f1e02/sentence_transformers/fit_mixin.py#L356-L365

The reason is that this is the backwards-compatible model.fit method, and this method never accepted an evaluation dataset (you couldn't get a validation loss before the v3.0 update). With other words: if you want to get a validation loss, then you have to use the new SentenceTransformerTrainer training approach. There is documentation on this training approach here: https://sbert.net/docs/sentence_transformer/training_overview.html

Otherwise, you can rely on the results from the evaluator (e.g. the cosine accuracy@1) to help you figure out when you're e.g. overfitting. Hope this clarifies things a bit.

  • Tom Aarsen

Thanks @tomaarsen . This clarifies my doubt . There is one more thing I wanted to clarify . Is there a method written which incorporates multiple positive pairs in MNR loss function (randomly picking a single positive pair for every evaluation step) ? I tried to create a dataset from dict with a list of positive pairs but it throws list index out of range error

tomaarsen commented 1 month ago

Is there a method written which incorporates multiple positive pairs in MNR loss function (randomly picking a single positive pair for every evaluation step) ? I tried to create a dataset from dict with a list of positive pairs but it throws list index out of range error

There is not. For context, we don't support lists of positive pairs because lists are variably sized, which is incompatible with proper batching. The solution is to instead convert the 1 anchor with n positives into n anchors with n positives and to use batch_sampler=BatchSamplers.NO_DUPLICATES, in the SentenceTransformerTrainingArguments. This batch samplers ensures that for each sample there is no overlap with any other sample in the batch. See this example:

from datasets import Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer("all-MiniLM-L6-v2")

# 3. Load a dataset to finetune on
raw_data = {
    "a": ["a_pos_1", "a_pos_2", "a_pos_3"],
    "b": ["b_pos_1", "b_pos_2", "b_pos_3", "b_pos_4"],
    "c": ["c_pos_1", "c_pos_2"],
}
data_dict = {
    "anchor": [],
    "positive": [],
}
for anchor, positives in raw_data.items():
    for positive in positives:
        data_dict["anchor"].append(anchor)
        data_dict["positive"].append(positive)
train_dataset = Dataset.from_dict(data_dict)

# 4. Define a loss function
class LoggingMultipleNegativesRankingLoss(MultipleNegativesRankingLoss):
    def forward(self, sentence_features, labels):
        print("Anchors:", self.model.tokenizer.batch_decode(sentence_features[0]["input_ids"], skip_special_tokens=True))
        print("Positives:", self.model.tokenizer.batch_decode(sentence_features[1]["input_ids"], skip_special_tokens=True))
        return super().forward(sentence_features, labels)

loss = LoggingMultipleNegativesRankingLoss(model)

# 5. (Optional) Specify training arguments
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mpnet-base-all-nli-triplet",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
)

# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

This training script:

  1. Uses some anchors ("a", "b", "c") each with a list of positives (note: different lengths are possible).
  2. Converts these anchors into a training dataset of length 9:
    Dataset({
       features: ['anchor', 'positive'],
       num_rows: 9
    })
  3. Uses a custom loss function which is exactly MNRL but it prints which samples are used in the batches.

The output is:

Anchors: ['a', 'b']
Positives: ['a _ pos _ 1', 'b _ pos _ 1']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 2', 'b _ pos _ 2']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 3', 'b _ pos _ 3']

Anchors: ['b', 'c']
Positives: ['b _ pos _ 4', 'c _ pos _ 1']

Anchors: ['c']
Positives: ['c _ pos _ 2']

So, as you can see, we don't have a batch with overlapping values. This means that we don't have e.g. a batch with anchors "a" & "a" and positives "a_pos_1" & "a_pos_2", because then for the first sample the MNRL would use "a_pos_2" as an in-batch negative and the second sample would use "a_pos_1" as an in-batch negative (even though they're both positives!).

For reference, if you don't use this batch sampler, you might get something like this:

Anchors: ['b', 'b']
Positives: ['b _ pos _ 4', 'b _ pos _ 2']

Anchors: ['a', 'c']
Positives: ['a _ pos _ 1', 'c _ pos _ 1']

Anchors: ['b', 'c']
Positives: ['b _ pos _ 3', 'c _ pos _ 2']

Anchors: ['b', 'a']
Positives: ['b _ pos _ 1', 'a _ pos _ 3']

Anchors: ['a']
Positives: ['a _ pos _ 2']

Here, the first batch will be useless/counterproductive.

As a sidenote: using BatchSamplers.NO_DUPLICATES can mean that you lose a small amount of samples. For example, if I set per_device_train_batch_size=3 and BatchSamplers.NO_DUPLICATES, then I get:

Anchors: ['a', 'b', 'c']
Positives: ['a _ pos _ 1', 'b _ pos _ 1', 'c _ pos _ 1']

Anchors: ['a', 'b', 'c']
Positives: ['a _ pos _ 2', 'b _ pos _ 2', 'c _ pos _ 2']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 3', 'b _ pos _ 3']

As you can see, this is only 8 samples in total: the ("b", "b_pos_4") pair got discarded because it didn't fit in any of the batches because the "b" already occurred in each. In practice, you'll have more data and this situation is much less likely: you might discard e.g. 1 sample out of a 1000.

I hope this helps!

Sc-arsgan commented 4 weeks ago

Is there a method written which incorporates multiple positive pairs in MNR loss function (randomly picking a single positive pair for every evaluation step) ? I tried to create a dataset from dict with a list of positive pairs but it throws list index out of range error

There is not. For context, we don't support lists of positive pairs because lists are variably sized, which is incompatible with proper batching. The solution is to instead convert the 1 anchor with n positives into n anchors with n positives and to use batch_sampler=BatchSamplers.NO_DUPLICATES, in the SentenceTransformerTrainingArguments. This batch samplers ensures that for each sample there is no overlap with any other sample in the batch. See this example:

from datasets import Dataset
from sentence_transformers import (
    SentenceTransformer,
    SentenceTransformerTrainer,
    SentenceTransformerTrainingArguments,
)
from sentence_transformers.losses import MultipleNegativesRankingLoss
from sentence_transformers.training_args import BatchSamplers
from sentence_transformers.evaluation import TripletEvaluator

# 1. Load a model to finetune with 2. (Optional) model card data
model = SentenceTransformer("all-MiniLM-L6-v2")

# 3. Load a dataset to finetune on
raw_data = {
    "a": ["a_pos_1", "a_pos_2", "a_pos_3"],
    "b": ["b_pos_1", "b_pos_2", "b_pos_3", "b_pos_4"],
    "c": ["c_pos_1", "c_pos_2"],
}
data_dict = {
    "anchor": [],
    "positive": [],
}
for anchor, positives in raw_data.items():
    for positive in positives:
        data_dict["anchor"].append(anchor)
        data_dict["positive"].append(positive)
train_dataset = Dataset.from_dict(data_dict)

# 4. Define a loss function
class LoggingMultipleNegativesRankingLoss(MultipleNegativesRankingLoss):
    def forward(self, sentence_features, labels):
        print("Anchors:", self.model.tokenizer.batch_decode(sentence_features[0]["input_ids"], skip_special_tokens=True))
        print("Positives:", self.model.tokenizer.batch_decode(sentence_features[1]["input_ids"], skip_special_tokens=True))
        return super().forward(sentence_features, labels)

loss = LoggingMultipleNegativesRankingLoss(model)

# 5. (Optional) Specify training arguments
args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="models/mpnet-base-all-nli-triplet",
    # Optional training parameters:
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    learning_rate=2e-5,
    warmup_ratio=0.1,
    fp16=True,  # Set to False if you get an error that your GPU can't run on FP16
    bf16=False,  # Set to True if you have a GPU that supports BF16
    batch_sampler=BatchSamplers.NO_DUPLICATES,  # MultipleNegativesRankingLoss benefits from no duplicate samples in a batch
)

# 7. Create a trainer & train
trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

This training script:

  1. Uses some anchors ("a", "b", "c") each with a list of positives (note: different lengths are possible).
  2. Converts these anchors into a training dataset of length 9:
    Dataset({
       features: ['anchor', 'positive'],
       num_rows: 9
    })
  3. Uses a custom loss function which is exactly MNRL but it prints which samples are used in the batches.

The output is:

Anchors: ['a', 'b']
Positives: ['a _ pos _ 1', 'b _ pos _ 1']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 2', 'b _ pos _ 2']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 3', 'b _ pos _ 3']

Anchors: ['b', 'c']
Positives: ['b _ pos _ 4', 'c _ pos _ 1']

Anchors: ['c']
Positives: ['c _ pos _ 2']

So, as you can see, we don't have a batch with overlapping values. This means that we don't have e.g. a batch with anchors "a" & "a" and positives "a_pos_1" & "a_pos_2", because then for the first sample the MNRL would use "a_pos_2" as an in-batch negative and the second sample would use "a_pos_1" as an in-batch negative (even though they're both positives!).

For reference, if you don't use this batch sampler, you might get something like this:

Anchors: ['b', 'b']
Positives: ['b _ pos _ 4', 'b _ pos _ 2']

Anchors: ['a', 'c']
Positives: ['a _ pos _ 1', 'c _ pos _ 1']

Anchors: ['b', 'c']
Positives: ['b _ pos _ 3', 'c _ pos _ 2']

Anchors: ['b', 'a']
Positives: ['b _ pos _ 1', 'a _ pos _ 3']

Anchors: ['a']
Positives: ['a _ pos _ 2']

Here, the first batch will be useless/counterproductive.

As a sidenote: using BatchSamplers.NO_DUPLICATES can mean that you lose a small amount of samples. For example, if I set per_device_train_batch_size=3 and BatchSamplers.NO_DUPLICATES, then I get:

Anchors: ['a', 'b', 'c']
Positives: ['a _ pos _ 1', 'b _ pos _ 1', 'c _ pos _ 1']

Anchors: ['a', 'b', 'c']
Positives: ['a _ pos _ 2', 'b _ pos _ 2', 'c _ pos _ 2']

Anchors: ['a', 'b']
Positives: ['a _ pos _ 3', 'b _ pos _ 3']

As you can see, this is only 8 samples in total: the ("b", "b_pos_4") pair got discarded because it didn't fit in any of the batches because the "b" already occurred in each. In practice, you'll have more data and this situation is much less likely: you might discard e.g. 1 sample out of a 1000.

I hope this helps!

  • Tom Aarsen

I understand your logic and reasoning but for my training datasets I have a list of positives and list of negatives for each anchor that I want model to understand . I tried to create 'n' pairs of {anchor,positive,negative} with no_duplicate data loader but for some reason it threw an error when I started training .I am using load_best_model and IR evaluator for multi dataset training , might be an issue with that (still trying to troubleshoot that).

My question to you would be : In sentence transformer documentation it is mentioned mnrl is best to use if you only have positive pairs and it uses inbatch negatives to make model learn . Is mnrl still best loss function if I want to specifically make model learn about negatives too ? [I know in documentation you mentioned about using HN for each pair but does that affect if we have multiple HN's and how I am creating those pairs?]

Also is there a way to turn off in batch negative sampling in mnrl ? I would be better off without it because of the way I created my training data by carefully curating positive and negative lists for each anchor .