Can I use lm-eval for training?

EleutherAI / lm-evaluation-harness

A framework for few-shot evaluation of language models.

https://www.eleuther.ai

MIT License

6.83k stars 1.82k forks source link

Can I use lm-eval for training? #2411

Open yaolu-zjut opened 1 week ago

yaolu-zjut commented 1 week ago

Hi, I appreciate your solid work. Since there are many tasks, I wonder if you can provide code to train on these datasets.

fblgit commented 1 week ago

You can take a look to the Bagel dataset, incorporates plenty of the tasks train sets: https://huggingface.co/datasets/jondurbin/bagel-v0.5

just keep in mind that tests and validation splits are not to be used for training purposes, thats contamination.

yaolu-zjut commented 1 week ago

You can take a look to the Bagel dataset, incorporates plenty of the tasks train sets: https://huggingface.co/datasets/jondurbin/bagel-v0.5

just keep in mind that tests and validation splits are not to be used for training purposes, thats contamination.

Hi fblgit, thank you for your answer. You are so kind, but bagel cannot meet our needs. We hope to perform separate training on each task instead of mixing multiple datasets to improve model performance. Therefore, we hope to have a general dataset processing framework to do this.

fblgit commented 1 week ago

A. Use datasets.map to split it out. --> fastest way imho. B. U can go thru the citations of Bagel where each Dataset is cited and available in HF hub so u can access it separately. C. U can check on lm-eval/tasks/$folder each task has a YAML with the dataset/split used.