How to filter the instruction tuning data?

X-PLUG / mPLUG-Owl

mPLUG-Owl: The Powerful Multi-modal Large Language Model Family

https://www.modelscope.cn/studios/damo/mPLUG-Owl

MIT License

2.25k stars 171 forks source link

How to filter the instruction tuning data? #39

Closed lovecambi closed 1 year ago

lovecambi commented 1 year ago

As the comment text in config file, the size of each dataset (# [50997(alpaca), 155562(llava), 53456(quora), 101466(sharegpt)] 361481 ) is different from the original dataset.

Is there any code or script to filter the data?

MAGAer13 commented 1 year ago

Hi, we did not filter the dataset. Since we held out some data for validation (~1k for each dataset), so the size of each dataset is smaller than the origin one.