Inquiry About the Data Source for the Model Training Set

AnonymXXXXX commented 4 months ago

Hello,

I have been exploring your code and datasets as described in the README and downloaded the dataset used for training BeLLM from https://huggingface.co/datasets/SeanLee97/all_nli_angle_format_b/tree/main, named SeanLee97/all_nli_angle_format_b.

After analyzing the data, I noticed that this dataset contains 480,862 rows and 3 columns. In comparison, previous works like PromptEOL utilized the simcse_nli dataset (comprised of SNLI and MNLI) which totals 275,602 rows x 3 columns.

I am curious about the composition of the all_nli_angle_format_b dataset and am wondering why there is a significant increase in the data amount. Could you please share some insights on how this dataset was compiled and what makes up the additional data?

Additionally, have you or your team tested the performance of PromptEOL with a larger dataset size of 480,862 rows x 3 columns? I am interested in understanding how the increase in dataset size might influence the model's performance.

Thank you for your time and assistance. I look forward to your response!

SeanLee97 commented 4 months ago

Hi @AnonymXXXXX, we did not use simcse_nli. We directly transformed AllNLI (including MultiNLI and SNLI) into triples. The AllNLI dataset is provided by sentence-transformers.

Here is the process script:

import os
import csv
import gzip
import json
import random

from tqdm import tqdm
from sentence_transformers import util
from datasets import load_dataset

save_path = 'all_nli.B.jsonl'
nli_dataset_path = "data/AllNLI.tsv.gz"
if not os.path.exists(nli_dataset_path):
    util.http_get("https://sbert.net/datasets/AllNLI.tsv.gz", nli_dataset_path)

def add_to_samples(sent1, sent2, label):
    if sent1 not in train_data:
        train_data[sent1] = {"contradiction": set(), "entailment": set(), "neutral": set()}
    train_data[sent1][label].add(sent2)

train_data = {}
with gzip.open(nli_dataset_path, "rt", encoding="utf8") as fIn:
    reader = csv.DictReader(fIn, delimiter="\t", quoting=csv.QUOTE_NONE)
    for row in reader:
        if row["split"] == "train":
            sent1 = row["sentence1"].strip()
            sent2 = row["sentence2"].strip()

            add_to_samples(sent1, sent2, row["label"])
            # add_to_samples(sent2, sent1, row["label"])  # Also add the opposite

data = []
for sent1, others in tqdm(train_data.items()):
    negs = list(others['contradiction'])
    if not negs:
        continue
    poss = list(others['entailment'])
    if not poss:
        continue
    for pos in poss:
        for neg in negs:
            data.append({'text': sent1, 'positive': pos, 'negative': neg})

print('size:', len(data))
random.shuffle(data)

with open(save_path, 'w') as writer:
    for obj in data:
        writer.writelines(json.dumps(obj, ensure_ascii=False) + '\n')

I know simcse also use multinli + snli, but I am not sure how simcse_nli is collected. We'd like to test it. But it might take some time as our computing resources are limited now. It requires A100 (80GB RAM) to run to support large batch sizes.

SeanLee97 commented 4 months ago

Sorry for that; we forgot to make the additional data public. Will do it later.

We collect the additional data as follows:

1) collect all texts from multinli + nli 2) index all texts to Elasticsearch (ES) 3) retrieve top 30 similar texts for each text using ES. 4) take the top 30 text as the hard negative

AnonymXXXXX commented 4 months ago

Thanks for the details. In addition, should the version number of angle-emb in requirements.txt be 0.3.10 instead of 3.1.0?

4AI / BeLLM

Inquiry About the Data Source for the Model Training Set #2