[Not an issue] - finetune falcon-40b with Qlora

canamika27 commented 1 year ago

Hi, I tried finetuning falcon-40b with Qlora & compare its performance with llama-65b which is also finetuned with Qlora & both finetuned on same dataset oassist.

And I tried to compare its MMLU eval results Please find the results below:

llama-65b: {'mmlu_loss': 2.6868371820325327, 'mmlu_eval_accuracy_global_facts': 0.4, 'mmlu_eval_accuracy_high_school_statistics': 0.391304347826087, 'mmlu_eval_accuracy_elementary_mathematics': 0.3902439024390244, 'mmlu_eval_accuracy_moral_scenarios': 0.52, 'mmlu_eval_accuracy_professional_law': 0.47058823529411764, 'mmlu_eval_accuracy_anatomy': 0.42857142857142855, 'mmlu_eval_accuracy_conceptual_physics': 0.5, 'mmlu_eval_accuracy_human_sexuality': 0.75, 'mmlu_eval_accuracy_formal_logic': 0.35714285714285715, 'mmlu_eval_accuracy_management': 0.8181818181818182, 'mmlu_eval_accuracy_human_aging': 0.8260869565217391, 'mmlu_eval_accuracy_high_school_mathematics': 0.2413793103448276, 'mmlu_eval_accuracy_logical_fallacies': 0.7222222222222222, 'mmlu_eval_accuracy_high_school_world_history': 0.7307692307692307, 'mmlu_eval_accuracy_business_ethics': 0.5454545454545454, 'mmlu_eval_accuracy_machine_learning': 0.7272727272727273, 'mmlu_eval_accuracy_miscellaneous': 0.7093023255813954, 'mmlu_eval_accuracy_public_relations': 0.5833333333333334, 'mmlu_eval_accuracy_college_medicine': 0.5909090909090909, 'mmlu_eval_accuracy_abstract_algebra': 0.36363636363636365, 'mmlu_eval_accuracy_high_school_microeconomics': 0.6538461538461539, 'mmlu_eval_accuracy_high_school_psychology': 0.8833333333333333, 'mmlu_eval_accuracy_high_school_physics': 0.29411764705882354, 'mmlu_eval_accuracy_philosophy': 0.7647058823529411, 'mmlu_eval_accuracy_college_biology': 0.6875, 'mmlu_eval_accuracy_high_school_chemistry': 0.36363636363636365, 'mmlu_eval_accuracy_moral_disputes': 0.5526315789473685, 'mmlu_eval_accuracy_computer_security': 0.6363636363636364, 'mmlu_eval_accuracy_world_religions': 0.8947368421052632, 'mmlu_eval_accuracy_security_studies': 0.6666666666666666, 'mmlu_eval_accuracy_sociology': 0.9545454545454546, 'mmlu_eval_accuracy_nutrition': 0.6666666666666666, 'mmlu_eval_accuracy_clinical_knowledge': 0.5172413793103449, 'mmlu_eval_accuracy_professional_accounting': 0.5483870967741935, 'mmlu_eval_accuracy_medical_genetics': 0.9090909090909091, 'mmlu_eval_accuracy_professional_medicine': 0.5161290322580645, 'mmlu_eval_accuracy_jurisprudence': 0.7272727272727273, 'mmlu_eval_accuracy_high_school_government_and_politics': 0.8571428571428571, 'mmlu_eval_accuracy_us_foreign_policy': 1.0, 'mmlu_eval_accuracy_electrical_engineering': 0.5, 'mmlu_eval_accuracy_high_school_european_history': 0.8333333333333334, 'mmlu_eval_accuracy_college_computer_science': 0.5454545454545454, 'mmlu_eval_accuracy_high_school_biology': 0.71875, 'mmlu_eval_accuracy_professional_psychology': 0.5797101449275363, 'mmlu_eval_accuracy_high_school_geography': 0.8181818181818182, 'mmlu_eval_accuracy_international_law': 0.9230769230769231, 'mmlu_eval_accuracy_virology': 0.5555555555555556, 'mmlu_eval_accuracy_college_mathematics': 0.2727272727272727, 'mmlu_eval_accuracy_high_school_macroeconomics': 0.6976744186046512, 'mmlu_eval_accuracy_astronomy': 0.6875, 'mmlu_eval_accuracy_high_school_us_history': 0.7727272727272727, 'mmlu_eval_accuracy_prehistory': 0.6857142857142857, 'mmlu_eval_accuracy_high_school_computer_science': 0.5555555555555556, 'mmlu_eval_accuracy_econometrics': 0.5, 'mmlu_eval_accuracy_college_physics': 0.36363636363636365, 'mmlu_eval_accuracy_marketing': 0.88, 'mmlu_eval_accuracy_college_chemistry': 0.375, 'mmlu_eval_accuracy': 0.6214914107432928, 'epoch': 3.39}

falcon-40b: {'mmlu_loss': 6.601600746573107, 'mmlu_eval_accuracy_philosophy': 0.7058823529411765, 'mmlu_eval_accuracy_electrical_engineering': 0.375, 'mmlu_eval_accuracy_high_school_world_history': 0.5, 'mmlu_eval_accuracy_marketing': 0.8, 'mmlu_eval_accuracy_moral_disputes': 0.47368421052631576, 'mmlu_eval_accuracy_clinical_knowledge': 0.4827586206896552, 'mmlu_eval_accuracy_high_school_geography': 0.7727272727272727, 'mmlu_eval_accuracy_high_school_microeconomics': 0.4230769230769231, 'mmlu_eval_accuracy_high_school_psychology': 0.6833333333333333, 'mmlu_eval_accuracy_high_school_statistics': 0.30434782608695654, 'mmlu_eval_accuracy_human_aging': 0.7391304347826086, 'mmlu_eval_accuracy_conceptual_physics': 0.5, 'mmlu_eval_accuracy_astronomy': 0.375, 'mmlu_eval_accuracy_professional_psychology': 0.463768115942029, 'mmlu_eval_accuracy_high_school_physics': 0.17647058823529413, 'mmlu_eval_accuracy_jurisprudence': 0.36363636363636365, 'mmlu_eval_accuracy_miscellaneous': 0.6162790697674418, 'mmlu_eval_accuracy_college_medicine': 0.36363636363636365, 'mmlu_eval_accuracy_formal_logic': 0.2857142857142857, 'mmlu_eval_accuracy_moral_scenarios': 0.3, 'mmlu_eval_accuracy_anatomy': 0.5714285714285714, 'mmlu_eval_accuracy_high_school_european_history': 0.4444444444444444, 'mmlu_eval_accuracy_college_mathematics': 0.2727272727272727, 'mmlu_eval_accuracy_international_law': 0.6153846153846154, 'mmlu_eval_accuracy_management': 0.6363636363636364, 'mmlu_eval_accuracy_professional_law': 0.2529411764705882, 'mmlu_eval_accuracy_professional_medicine': 0.3870967741935484, 'mmlu_eval_accuracy_virology': 0.5555555555555556, 'mmlu_eval_accuracy_nutrition': 0.5454545454545454, 'mmlu_eval_accuracy_machine_learning': 0.2727272727272727, 'mmlu_eval_accuracy_high_school_computer_science': 0.4444444444444444, 'mmlu_eval_accuracy_high_school_us_history': 0.4090909090909091, 'mmlu_eval_accuracy_high_school_government_and_politics': 0.5714285714285714, 'mmlu_eval_accuracy_prehistory': 0.4, 'mmlu_eval_accuracy_college_computer_science': 0.18181818181818182, 'mmlu_eval_accuracy_college_physics': 0.36363636363636365, 'mmlu_eval_accuracy_business_ethics': 0.45454545454545453, 'mmlu_eval_accuracy_us_foreign_policy': 0.7272727272727273, 'mmlu_eval_accuracy_elementary_mathematics': 0.24390243902439024, 'mmlu_eval_accuracy_abstract_algebra': 0.36363636363636365, 'mmlu_eval_accuracy_sociology': 0.7727272727272727, 'mmlu_eval_accuracy_college_chemistry': 0.0, 'mmlu_eval_accuracy_world_religions': 0.7894736842105263, 'mmlu_eval_accuracy_human_sexuality': 0.25, 'mmlu_eval_accuracy_global_facts': 0.4, 'mmlu_eval_accuracy_high_school_mathematics': 0.2413793103448276, 'mmlu_eval_accuracy_public_relations': 0.75, 'mmlu_eval_accuracy_professional_accounting': 0.45161290322580644, 'mmlu_eval_accuracy_high_school_biology': 0.375, 'mmlu_eval_accuracy_computer_security': 0.18181818181818182, 'mmlu_eval_accuracy_security_studies': 0.2962962962962963, 'mmlu_eval_accuracy_college_biology': 0.4375, 'mmlu_eval_accuracy_logical_fallacies': 0.6666666666666666, 'mmlu_eval_accuracy_high_school_macroeconomics': 0.4186046511627907, 'mmlu_eval_accuracy_high_school_chemistry': 0.2727272727272727, 'mmlu_eval_accuracy_econometrics': 0.25, 'mmlu_eval_accuracy_medical_genetics': 0.9090909090909091, 'mmlu_eval_accuracy': 0.4540568812107724, 'epoch': 3.39}

Its clear that llam-65b is doing better than falcon-40b , not sure if I am doing anything wrong while finetuning (using the qlora.py provided in repo for finetuning)

Please let me know if anyone tried this comparison.

tytung2020 commented 1 year ago

mind to share the evaluation code?

phalexo commented 1 year ago

I am curious what GPU set up was used for Falcon. I have tried to run fine-tuning on Llama-65B and Gaunaco-65B on 4 12.288GiB GPUs. I get various errors, the moment training starts.

I have been able to load Falcon-40B-instruct on the same GPUs, but the UI does not seem to let me enter any questions, it just loops on its own input/output.

So, what config did you use for Falcon? What were the flags? What was the training data?

Thanks.

artidoro commented 1 year ago

I also found in my experiments that Falcon 40B was not as good as LLaMA 65B on MMLU. One thing to note on the experiments above is whether they are run on the eval or test set. Another thing is that the same hyperparameters might not work well for LLaMA and Falcon. Also, it's probably fairer to compare LLaMA 33B and Falcon 40B.

Generally, I am a bit suspicious of the HF open LLM leaderboard because the LLaMA results are not matching the paper results. I think there might be some problems with how they handle tokenization there.

phalexo commented 1 year ago

I also found in my experiments that Falcon 40B was not as good as LLaMA 65B on MMLU. One thing to note on the experiments above is whether they are run on the eval or test set. Another thing is that the same hyperparameters might not work well for LLaMA and Falcon. Also, it's probably fairer to compare LLaMA 33B and Falcon 40B.

Generally, I am a bit suspicious of the HF open LLM leaderboard because the LLaMA results are not matching the paper results. I think there might be some problems with how they handle tokenization there.

Can you give me the script you used to launch the training session? And whatever flags you may have set inside into qlora.py.

I observe model weights being loaded successfully, then some training data. Then it 1-2 passes across my 4 GPUs, and then

File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/nn/modules.py", line 219, in forward out = bnb.matmul_4bit(x, self.weight.t(), bias=bias, quant_state=self.weight.quant_state) File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 566, in matmul_4bit return MatMul4Bit.apply(A, B, out, bias, quant_state) File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/torch/autograd/function.py", line 506, in apply return super().apply(*args, **kwargs) # type: ignore[misc] File "/home/developer/mambaforge/envs/Guanaco/lib/python3.10/site-packages/bitsandbytes-0.39.0-py3.10.egg/bitsandbytes/autograd/_functions.py", line 514, in forward output = torch.nn.functional.linear(A, F.dequantize_fp4(B, state).to(A.dtype).t(), bias) torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1024.00 MiB (GPU 0; 11.93 GiB total capacity; 10.79 GiB already allocated; 545.88 MiB free; 10.84 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

mtisz commented 1 year ago

You guys are discussing something important here. Yes, seems like the HF Leaderboard is not accurate. More information in the following two links:

Regardless though, I like Falcon because it's now under Apache 2 licence. We'll keep going at this, we still need to implement RLHF and get the entire open source community to give the human feedback required to completely fine-tune like ChatGPT, GPT-4 or Claude.

artidoro / qlora

[Not an issue] - finetune falcon-40b with Qlora #138