UKPLab / sentence-transformers

Multilingual Sentence & Image Embeddings with BERT
https://www.SBERT.net
Apache License 2.0
14.56k stars 2.41k forks source link

object has no attribute 'cache_files' in trainer sentence_transformers 3.0 #2706

Open claracaste opened 1 month ago

claracaste commented 1 month ago

Hi, I am using sentence_transformers version 3.0.0 I created a dataset from a pandas dataframe


class PairsDataset(Dataset):
    def __init__(self, dataframe):
        self.dataframe = dataframe

    def __getitem__(self, index):

        return {'sentence1': self.dataframe.loc[index,'sentence1'], 'sentence2': self.dataframe.loc[index,'sentence2'], 'score':self.dataframe.loc[index,'true_label']}

    def __len__(self):
        return len(self.dataframe)

   ds_train = PairsDataset(df_train)

   trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    loss=train_loss,
    evaluator=dev_evaluator,
)

When I run trainer.train() I get the error AttributeError: 'PairsDataset' object has no attribute 'cache_files' I see in the source code that it is trying to read some metadata from my dataset which it doesn't find. How can I overcome this problem?

tomaarsen commented 1 month ago

Hello!

I would recommend converting the Pandas DataFrame into a datasets.Dataset (some docs). You can do this with Dataset.from_pandas:

from datasets import Dataset

ds_train = Dataset.from_pandas(df_train)
ds_train = ds_train.rename_columns({"true_label": "score"})

trainer = SentenceTransformerTrainer(
    model=model,
    args=args,
    train_dataset=ds_train,
    eval_dataset=ds_val,
    loss=train_loss,
    evaluator=dev_evaluator,
)

Hope this helps!

claracaste commented 1 month ago

Thanks @tomaarsen