OpenThaiGPT / openthaigpt-pretraining

Apache License 2.0
21 stars 10 forks source link

[Super AI] Create Script to search for testset leakage on Trainset #181

Open ArthurMinovsky opened 1 year ago

ArthurMinovsky commented 1 year ago

Description: We want to make sure our selected testset is not overlapped with trainset. We can search for testset leakage on OSCAR by using retrieval model (ex. https://arxiv.org/abs/2005.14165)

Instructions:

Deliverables:

new5558 commented 1 year ago

Update

Checklist:

Final workflow:

  1. Load each evaluation dataset in #182 in memory
  2. Search for contamination from the huggingface dataset (we can use mC4 for now and switch to our PantipOSCAR dataset when #174 is finished)
  3. Remove contaminated dataset and save to huggingface dataset in the same format