UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.4k stars 2.49k forks source link

Can't start training with a SentenceLabelDataset due to error `'SentenceLabelDataset' object has no attribute 'column_names'` #3082

Open HenningDinero opened 1 day ago

HenningDinero commented 1 day ago

If I have 4 targets/clusters say, ["car", "airplane", "boat", "train"] and 1000 sentences for each class, where I want to fine-tune a model to create similar embeddings within each class.

As far as I can understand that is where the SentenceLabelDataset could be used, or when looking at https://github.com/UKPLab/sentence-transformers/issues/2920 the GroupByLabelBatchSampler, or maybe just use the "usual way" using MNRL and create anchor/positives within each class (although that would create negatives from the same class aswell, which is why I'll try the other apporaches). Currently I'm trying with SentenceLabelDataset but theres a struggle with starting the training. Please find below some (pseudo) code:

# For creating the data 
def _create_input_example(df: pd.DataFrame):
    label = df.name
    return InputExample(guid=label, texts=df["Documents"], label=label)

def get_main_data() -> tuple[Dataset, Dataset]:
    data = get_data()
    le = LabelEncoder()
    le.fit(data["TransportType"])
    data["label"] = le.transform(data["TransportType"])

    training_data = data.query("TrainTest=='TRAIN'")
    val_data = data.query("TrainTest=='VALIDATE'")

    training_examples = training_data.groupby(["label"])[["Documents"]].apply(_create_input_example)
    val_examples = val_data.groupby(["label"])[["Documents"]].apply(_create_input_example)

    train_dataset = SentenceLabelDataset(training_examples, samples_per_label=32, with_replacement=True)
    val_dataset = SentenceLabelDataset(val_examples, samples_per_label=32, with_replacement=True)

    #train_dataloader = NoDuplicatesDataLoader(train_dataset, batch_size=32) 
    #val_dataloader = NoDuplicatesDataLoader(val_dataset, batch_size=32)

    return train_dataset, val_dataset

and the training

steps = 10
train_data, val_data = get_data()
model = SentenceTransformer(
        "intfloat/multilingual-e5-small"
    )
loss = losses.MultipleNegativesRankingLoss(model, scale=20.0, similarity_fct=util.cos_sim)
training_args = SentenceTransformerTrainingArguments(
        # Required parameter:
        output_dir="./sbert_fitted/",
        # Optional training parameters:
        num_train_epochs=1,
        eval_steps=steps,
        eval_strategy="steps",
        save_strategy="steps",
        save_steps=steps,
        logging_steps=steps,
    )

    trainer = SentenceTransformerTrainer(
        model=model,
        args=training_args,
        train_dataset=train_data,
        eval_dataset=val_data,
        loss=loss,
    )
    trainer.train()

This one throws the error AttributeError: 'SentenceLabelDataset' object has no attribute 'column_names'. I have also tried using the NoDuplicatesDataLoader but that gives the error AttributeError: 'NoDuplicatesDataLoader' object has no attribute 'column_names'.

So 2 questions:

1) Is the creation of the labeled-dataset correct i.e simply by creating one InputExample for each target where the texts just are all the documents for the given target? 2) Can you see where I'm wrong with the errors

tomaarsen commented 4 hours ago

Hello!

Apologies for the confusion here! Sentence Transformers v3 refactored the training approach, and the old approach still exists (for now) so people can still use that if they prefer. What's happening here is that you're using components of the new training approach (SentenceTransformerTrainer, SentenceTransformerTrainingArguments) together with components of the old approach (SentenceLabelDataset, InputExample).

Instead, my recommendation is to move fully to the new approach. Let's start with a loss function. If we have sentences with classes, then we can use one of these loss functions: image (Loss Overview docs)

For example, the BatchAllTripletLoss uses single sentences & a class labels as inputs. The SentenceTransformerTrainer then expects the training/evaluation dataset to be a Dataset from the datasets package with 2 columns. As explained in the Dataset Format docs, the class labels must be in a column called label or score, while the texts can be in a column with any name.

So, we'll get something like:

# E.g. 0: sports, 1: economy, 2: politics
train_dataset = Dataset.from_dict({
    "sentence": [
        "He played a great game.",
        "The stock is up 20%",
        "They won 2-1.",
        "The last goal was amazing.",
        "They all voted against the bill.",
    ],
    "label": [0, 1, 0, 0, 2],
})

Then, we can follow the BatchAllTripletLoss recommendation: Using batch_sampler=BatchSamplers.GROUP_BY_LABEL. This ensures that each batch contains at least 2 examples per class in each batch - this makes the loss the most useful.

A minimal script should become something like:

from sentence_transformers import SentenceTransformer, SentenceTransformerTrainer, losses
from sentence_transformers.training_args import BatchSamplers
from datasets import Dataset

model = SentenceTransformer("microsoft/mpnet-base")
# E.g. 0: sports, 1: economy, 2: politics
train_dataset = Dataset.from_dict({
    "sentence": [
        "He played a great game.",
        "The stock is up 20%",
        "They won 2-1.",
        "The last goal was amazing.",
        "They all voted against the bill.",
    ],
    "label": [0, 1, 0, 0, 2],
})
loss = losses.BatchAllTripletLoss(model)

args = SentenceTransformerTrainingArguments(
    # Required parameter:
    output_dir="./sbert_fitted/",
    # Optional training parameters:
    num_train_epochs=1,
    batch_sampler=BatchSamplers.GROUP_BY_LABEL,
    eval_steps=steps,
    eval_strategy="steps",
    save_strategy="steps",
    save_steps=steps,
    logging_steps=steps,
)

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=train_dataset,
    loss=loss,
)
trainer.train()

Afterwards, you can also experiment with the common MultipleNegativesRankingLoss, but as shown in the Loss Overview, you'll need for example (anchor, positive) pairs or (anchor, positive, negative) triplets. This kind of data can be created with your data by going e.g.:

for each class:
    for each sentence in class:
        anchor = sentence
        positive = random sentence from the same class
        negative = random sentence from any other class

and then you'll have a bunch of triplets. Then you can use BatchSamplers.NO_DUPLICATES because it can be bad if a batch contains the same text multiple times. There's a decent chance that this form of training results in better performance - I can't say for sure.

HenningDinero commented 2 hours ago

Thank you very much! Yes, I might've mixed v2 and v3 (I had some v2 training scripts that I tried to adapt to v3, and I might've forgotten somehting here and there).

I'll give both of your suggestions a go and see how it goes :)