UKPLab / sentence-transformers

State-of-the-Art Text Embeddings
https://www.sbert.net
Apache License 2.0
15.06k stars 2.45k forks source link

Training cross-encoder with online pair-sampling? #1057

Open Zacchaeus00 opened 3 years ago

Zacchaeus00 commented 3 years ago

Hi, Awesome work. I read the training tutorial for cross-encoder (https://www.sbert.net/examples/training/cross-encoder/README.html). It creates the dataset from a static file of format (text1, text2, label):

label2int = {"contradiction": 0, "entailment": 1, "neutral": 2}
train_samples = []
dev_samples = []
with gzip.open(nli_dataset_path, 'rt', encoding='utf8') as fIn:
    reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
    for row in reader:
        label_id = label2int[row['label']]
        if row['split'] == 'train':
            train_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=label_id))
        else:
            dev_samples.append(InputExample(texts=[row['sentence1'], row['sentence2']], label=label_id))

However, I have a dataset of format (text, label). I want to sample two rows from the dataset (text1, label1), (text2, label2) and generate a training sample like (text1, text2, 1(label1<label2). I want to do this online during training. Is there any way make this work using sentence-transformer cross encoder? Thanks

nreimers commented 3 years ago

Yes. You pass a dataloader to the fit method. You can easily create your own DataSet and/or DataLoader.

See here: https://pytorch.org/docs/stable/data.html

How data loading is handled in Pytorch