Closed RaymondLi0 closed 1 year ago
Proposed list of experiments to run:
https://docs.google.com/spreadsheets/d/1xOIYoExQP_haA80ArY09fAk49fNVTO8eJrYysw71FQs/edit?usp=sharing
This list was created using this notebook: https://github.com/bigcode-project/Megatron-LM/blob/raymond-notebooks/notebooks/transformer_parameter_count.ipynb Still open questions:
[ ] Which languages to train on? We could afford to do each experiment on single-language and multi-language datasets, doubling the compute.
[ ] Which evaluations. HumanEval, MBPP, repo-level eval? Some downstream tasks with finetuning? https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/
Some facts to decide which dataset to train on:
After additional near-dedup, we have:
Some of the evaluations that may be relevant:
Proposed list of experiments to run:
https://docs.google.com/spreadsheets/d/1xOIYoExQP_haA80ArY09fAk49fNVTO8eJrYysw71FQs/edit?usp=sharing
This list was created using this notebook: https://github.com/bigcode-project/Megatron-LM/blob/raymond-notebooks/notebooks/transformer_parameter_count.ipynb Still open questions:
[ ] Which languages to train on? We could afford to do each experiment on single-language and multi-language datasets, doubling the compute.
[ ] Which evaluations. HumanEval, MBPP, repo-level eval? Some downstream tasks with finetuning? https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/finetuning/