Our model won 🏆 first prize 🏆 in RTX 4090 track in NeurIPS Large Language Model Efficiency Challenge: 1 LLM + 1GPU + 1Day competition. We used Mistral-7B as a base model and used QLoRA to fine-tune it for 24 hours on a single RTX 4090 GPU.
Model Name | Checkpoint | Dataset | License |
---|---|---|---|
Birbal-7B-V1 | 🤗 Birbal-7B-V1 | upaya07/NeurIPS-LLM-data | Apache License 2.0 |
Task | Score |
---|---|
MMLU - EM | 0.629 |
MMLU - EM (Robustness) | 0.591 |
MMLU - EM (Fairness) | 0.596 |
MMLU Mean Win Rate | 0.417 |
TruthfulQA - EM | 0.59 |
TruthfulQA - EM (Robustness) | 0.541 |
TruthfulQA - EM (Fairness) | 0.492 |
TruthfulQA Mean Win Rate | 0.75 |
BIG-bench - EM | 0.330 |
BIG-bench Mean Win Rate | 0.75 |
GSM8K - EM | 0.443 |
GSM8K Mean Win Rate | 0.625 |
BBQ - EM | 0.738 |
BBQ Mean Win Rate | 0.25 |
sam_sum - ROUGE-2 | 0.127 |
sam_sum - Stereotypes (race) | 0.667 |
sam_sum - Stereotypes (gender) | 0.447 |
sam_sum - Representation (race) | 0.458 |
sam_sum - Representation (gender) | 0.013 |
sam_sum Mean Win Rate | 0.383 |
corr2cause - EM | 0.615 |
corr2cause Mean Win Rate | 0.875 |
MATH (chain-of-thoughts) - Equivalent (chain of thought) | 0.121 |
MATH Mean Win Rate | 0.75 |
ethics_justice - EM | 0.68 |
ethics_justice - EM (Robustness) | 0.645 |
ethics_justice - EM (Fairness) | 0.62 |
ethics_commonsense - EM | 0.41 |
ethics_commonsense - EM (Robustness) | 0.33 |
ethics_commonsense - EM (Fairness) | 0.345 |
ethics_virtue - EM | 0.895 |
ethics_virtue - EM (Robustness) | 0.865 |
ethics_virtue - EM (Fairness) | 0.86 |
ethics_deontology - EM | 0.63 |
ethics_deontology - EM (Robustness) | 0.585 |
ethics_deontology - EM (Fairness) | 0.595 |
ethics_utilitarianism - EM | 0.72 |
ethics_utilitarianism - EM (Robustness) | 0.6 |
ethics_utilitarianism - EM (Fairness) | 0.645 |
ethics Mean Win Rate | 0.55 |
🔥 Score_full | 0.579 |
🔥 Score_open | 0.516 |
🔥 Score_hidden | 0.61 |
Position | Score |
---|---|
5th rank | 0.362 |
4th rank | 0.371 |
3rd rank | 0.381 |
2nd rank | 0.424 |
🔥 Ours (1st) | 0.579 |
Refer to 4090_full_ranks.json file for scores of top-few teams that were part of final stage in competition.
Model | Checkpoint | Dataset | License |
---|---|---|---|
Birbal-200k | 🤗 Birbal-200k | 200k | Apache License 2.0 |
Birbal-400k | 🤗 Birbal-400k | 400k | Apache License 2.0 |
Birbal-700k | 🤗 Birbal-700k | 700k | Apache License 2.0 |
Natural Instructions dataset is a community effort to create a large collection of tasks and their natural language definitions/instructions. As show in above diagram, we sample from Natural Instructions dataset. Here is the 4-step process:
A record from a task file from Natural Instruction data is converted into below format. orig_input
field is actual input without few-shot examples. few_shot_prompt
field represents a few-shot example and is passed to Mistral-7B model for prediction. answer
is ground truth and prediction
is output generated by Mistral-7B base model.
{
"orig_input": "Context: I sold my $90,000.00 Mercedes G500 and bought 3 Prius's, because I got tired of being pulled over by Police. #Adapt @chrisrock\u2014 Isaiah Washington (@IWashington) April 1, 2015 Question: how many prius's did they buy? Answer: three",
"few_shot_prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\nIn this task, you are given a context tweet, a question and corresponding answer of given question. Your task is to classify this question-answer pair into two categories: (1) \"yes\" if the given answer is right for question, and (2) \"no\" if the given answer is wrong for question.\n\n### Input:\nContext: Our prayers are with the students, educators & families at Independence High School & all the first responders on the scene. #PatriotPride\u2014 Doug Ducey (@dougducey) February 12, 2016 Question: at which school were first responders on the scene for? Answer: arizona high school\n\n### Response:\nno\n\n### Input:\nContext: @williebosshog huge love to you/your family huge respect for your business prosperities and the family values you still all behold. big fan\u2014 Liam Payne (@Real_Liam_Payne) January 18, 2014 Question: what was liam showing towards willy? Answer: huge respect\n\n### Response:\nyes\n\n### Input:\nContext: @williebosshog huge love to you/your family huge respect for your business prosperities and the family values you still all behold. big fan\u2014 Liam Payne (@Real_Liam_Payne) January 18, 2014 Question: what was liam showing towards willy? Answer: jealousy\n\n### Response:\nno\n\n### Input:\nContext: Our prayers are with the students, educators & families at Independence High School & all the first responders on the scene. #PatriotPride\u2014 Doug Ducey (@dougducey) February 12, 2016 Question: at which school were first responders on the scene for? Answer: independence high school\n\n### Response:\nyes\n\n### Input:\nContext: I sold my $90,000.00 Mercedes G500 and bought 3 Prius's, because I got tired of being pulled over by Police. #Adapt @chrisrock\u2014 Isaiah Washington (@IWashington) April 1, 2015 Question: how many prius's did they buy? Answer: three\n\n### Response:\n",
"answer": [
"yes"
],
"prediction": "yes\n\n### Input:\nContext: I sold my $90,000.00 Mercedes G500 and bought 3 Pri"
}
Final model training data: https://huggingface.co/datasets/upaya07/NeurIPS-LLM-data
# clone repository
git clone git@github.com:Upaya07/NeurIPS-llm-efficiency-challenge.git
cd NeurIPS-llm-efficiency-challenge/training/axolotl
# installation
pip install packaging
pip install -e '.[flash-attn]'
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cu118
pip install -U git+https://github.com/huggingface/peft.git
# Downloads required data and launches model fine-tuning. Runs 3-epochs on data. Script keeps track of the best checkpoint based on eval_loss.
# nips_02.yml file contains all hyperparams.
accelerate launch -m axolotl.cli.train examples/mistral/nips/nips_02.yml