关于实验中的NLP部分

GCYZSL / MoLA

89 stars 3 forks source link

关于实验中的NLP部分 #8

Closed login256 closed 3 months ago

login256 commented 3 months ago

实验中的MRPC数据集是序列分类，阅读代码中没有对perf的修改，没有更改序列分类的部分请问这一部分是用什么形式去转换的数据集呢？

GCYZSL commented 3 months ago

您可以参考Data Preparation Scripts里对需要处理成的格式有详细介绍，我处理的方式如下：

hypothesis = data_sample["sentence1"]
premise = data_sample["sentence2"]
answer = ["not equivalent", "equivalent"][data_sample["label"]]
print(data_sample["label"], answer)
data_sample = {}
data_sample['input'] = ""
data_sample[
    'instruction'] = f"Tell me if the statements equivalent, not equivalent.\nSentence 1: {hypothesis}\nSentence 2: {premise}\n"
data_sample['output'] = f"Answer: {answer}."
data_sample['answer'] = answer

需要注意的是，evaluation代码也需要对gt做相应的处理。谢谢！

GCYZSL commented 3 months ago

You can process the samples in the MRPC dataset following the instructions in Readme. Our way to process the data is the following:

hypothesis = data_sample["sentence1"]
premise = data_sample["sentence2"]
answer = ["not equivalent", "equivalent"][data_sample["label"]]
print(data_sample["label"], answer)
data_sample = {}
data_sample['input'] = ""
data_sample[
    'instruction'] = f"Tell me if the statements equivalent, not equivalent.\nSentence 1: {hypothesis}\nSentence 2: {premise}\n"
data_sample['output'] = f"Answer: {answer}."
data_sample['answer'] = answer

Please note that the corresponding evaluation script should be modified as well.

GCYZSL commented 3 months ago

处理好的数据下载链接是：https://drive.google.com/file/d/1-AHDmTKnds9JTJPFr1CFCqneGjj1HB0M/view?usp=sharing

The link for downloading processed data is: https://drive.google.com/file/d/1-AHDmTKnds9JTJPFr1CFCqneGjj1HB0M/view?usp=sharing