Train - test leakage - Githubissues

YannDubs commented 1 year ago

Hi, I looked into the OOD results and many examples in the test sets seem to be in the train set. E.g. DengueFilipino has the same train and test set. KirundiNews has 90% overlap...

to reproduce:

from data import *
dataloaders = dict(DengueFilipino=load_filipino, 
                   KirundiNews=load_kirnews, 
                   KinyarwandaNews=load_kinnews, 
                   SwahiliNews=load_swahili)

for data_name, loader in dataloaders.items():
    train, test = loader();
    overlap = 1 - len(set(test) - set(train)) / len(set(test))
    print(data_name, f"train<->test overlap: {overlap * 100:.1f}%")

DengueFilipino train<->test overlap: 100.0%
KirundiNews train<->test overlap: 90.4%
KinyarwandaNews train<->test overlap: 23.8%
SwahiliNews train<->test overlap: 0.5%

kts commented 1 year ago

Wow. It looks like the issue is in the original huggingface datasets. Lots of duplicates too.

Here are the stats using only huggingface datasets.load_dataset():

from datasets import load_dataset
from tabulate import tabulate

# dataset info from data.py:
# [name, args to load_dataset(), keys used on each item]
ds_info = [

    ("filipino",
     ['dengue_filipino'],
     ['text', 'absent', 'dengue', 'health', 'mosquito', 'sick']),

    ("kirnews",
     ["kinnews_kirnews","kirnews_cleaned"],
     ['label','title','content']),

    ("kinnews",
     ["kinnews_kirnews", "kinnews_cleaned"],
     ['label','title','content']),

    ("swahili",
     ['swahili_news'],
     ['label','text']),

]

lines = []
for name,args,keys in ds_info:

    ds = load_dataset(*args)

    # convert to list-of-tuples:
    train = [tuple([item[key] for key in keys]) for item in ds['train']]
    test  = [tuple([item[key] for key in keys]) for item in ds['test']]

    lines.append(name)

    n_overlap = len(set(train).intersection(test))
    lines.append(tabulate([
        ("train:",        len(train)),
        ("train unique:", len(set(train))),
        ("test:",         len(test)),
        ("test unique:",  len(set(test))),

        ("train/test overlap:", n_overlap,
         "%.1f%%" % (100.0 * n_overlap / len(set(test)))),
    ]))
    lines.append("\n")

print("\n".join(lines))

filipino
-------------------  ----  ------
train:               4015
train unique:        3947
test:                4015
test unique:         3947
train/test overlap:  3947  100.0%
-------------------  ----  ------

kirnews
-------------------  ----  -----
train:               3689
train unique:        1791
test:                 923
test unique:          698
train/test overlap:   631  90.4%
-------------------  ----  -----

kinnews
-------------------  -----  -----
train:               17014
train unique:         9199
test:                 4254
test unique:          2702
train/test overlap:    643  23.8%
-------------------  -----  -----

swahili
-------------------  -----  ----
train:               22207
train unique:        22207
test:                 7338
test unique:          7338
train/test overlap:     34  0.5%
-------------------  -----  ----

ljvmiranda921 commented 1 year ago

I suggest the authors to download the DengueFilipino dataset from the original link instead of Hugging Face. I'm also working on some Tagalog pipelines and I noticed the same upload issues (basically the train and test are 1:1 match).

I wrote a parser and some personal notes (file docstring) here. The parser uses some spaCy primitives but feel free to use this as you see fit: https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

bazingagin commented 1 year ago

Hi @YannDubs, wow thanks for pointing this out!!! I was only aware of the dataset issue of DengueFilipino. Thanks @kts for verifying the huggingface dataset issue. People should be aware of that and use the original link for those datasets. I will redo the experiment on Filipino using the original link @ljvmiranda921 provided. I will also check if the issue of KirundiNews overlapped happened in their original dataset. Thanks again!

bazingagin commented 1 year ago

Here are results using the original DengueFilipino dataset. I also checked the original Kirundi dataset, it still has the data contamination issue.

maoxuxu commented 4 months ago

我建议作者从原始链接下载 DengueFilipino 数据集，而不是Hugging Face。我也在研究一些 Tagalog 管道，我注意到了同样的上传问题（基本上训练和测试是 1:1 匹配的）。

我在这里编写了一个解析器和一些个人笔记（文件文档字符串）。解析器使用了一些 spaCy 原语，但您可以随意使用它：https://github.com/ljvmiranda921/calamanCy/blob/master/reports/emnlp2023/benchmark/scripts/process_dengue.py

Hello, may I ask if you can provide the Filipino dataset? Cannot download from the original link, thank you very much.

bazingagin / npc_gzip

Train - test leakage #13