Open yaolu-zjut opened 1 week ago
You can take a look to the Bagel dataset, incorporates plenty of the tasks train sets: https://huggingface.co/datasets/jondurbin/bagel-v0.5
just keep in mind that tests and validation splits are not to be used for training purposes, thats contamination.
You can take a look to the Bagel dataset, incorporates plenty of the tasks train sets: https://huggingface.co/datasets/jondurbin/bagel-v0.5
just keep in mind that tests and validation splits are not to be used for training purposes, thats contamination.
Hi fblgit, thank you for your answer. You are so kind, but bagel cannot meet our needs. We hope to perform separate training on each task instead of mixing multiple datasets to improve model performance. Therefore, we hope to have a general dataset processing framework to do this.
A. Use datasets.map to split it out. --> fastest way imho.
B. U can go thru the citations of Bagel where each Dataset is cited and available in HF hub so u can access it separately.
C. U can check on lm-eval/tasks/$folder
each task has a YAML with the dataset/split used.
Hi, I appreciate your solid work. Since there are many tasks, I wonder if you can provide code to train on these datasets.