about result on mmlu - Githubissues

lightest1 commented 4 days ago

Hello! Thank you for your work! I have conducted a version of fine tuning training on databricks-dolly-15k according to your script Settings. When I tried to evaluate mmlu on opencompass, I found that the result weighted_average was 38.38, which was even worse than the original llama2-7b. I don't know where I set the wrong, can you help me to confirm? my fine-tuning.sh：

export CUDA_HOME=/usr/local/cuda-11.8
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64
export PATH=${CUDA_HOME}/bin:${PATH}

# User-defined parameters
lr=0.0002
lora_rank=8
lora_alpha=32
lora_trainable="gate_proj,down_proj,up_proj"
lora_dropout=0.05                 
pretrained_model=''
tokenizer_path=''
dataset_dir=''
validation_file=None
per_device_train_batch_size=2
per_device_eval_batch_size=1
gradient_accumulation_steps=8
max_seq_length=1024
output_dir=''
deepspeed_config_file=''
exp_name=lora_model_1

lora_b_nums=3  # Developer-specific, k-means, or DBSCAN et al.

CUDA_VISIBLE_DEVICES=0,1 \
CUDA_LAUNCH_BLOCKING=0 \
torchrun --nnodes 1 --nproc_per_node 2 --node_rank 0 --master_port 29502 \
    fine-tuning.py \
    --deepspeed ${deepspeed_config_file} \
    --model_name_or_path ${pretrained_model} \
    --tokenizer_name_or_path ${tokenizer_path} \
    --dataset_dir ${dataset_dir} \
    --per_device_train_batch_size ${per_device_train_batch_size} \
    --per_device_eval_batch_size ${per_device_eval_batch_size} \
    --do_train \
    --seed 41 \
    --bf16 \
    --num_train_epochs 1 \
    --lr_scheduler_type cosine \
    --learning_rate ${lr} \
    --warmup_ratio 0.03 \
    --weight_decay 0 \
    --logging_strategy steps \
    --logging_steps 5 \
    --save_strategy steps \
    --save_total_limit 5 \
    --evaluation_strategy steps \
    --eval_steps 5000 \
    --save_steps 5000 \
    --gradient_accumulation_steps ${gradient_accumulation_steps} \
    --preprocessing_num_workers 8 \
    --max_seq_length ${max_seq_length} \
    --output_dir ${output_dir}/${exp_name} \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --lora_rank ${lora_rank} \
    --lora_alpha ${lora_alpha} \
    --lora_nums ${lora_b_nums} \
    --trainable ${lora_trainable} \
    --lora_dropout ${lora_dropout} \
    --torch_dtype bfloat16 \
    --validation_file ${validation_file} \
    --load_in_kbits 16 \
    --ddp_find_unused_parameters False \
    --overwrite_output_dir \

I choose `opencompass/configs/eval_hf_llama2.py' for evaluating scripts, opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d.py for dataset scripts. And my model scripts is like:

from opencompass.models import HuggingFaceBaseModel,HuggingFaceCausalLM

models = [
    dict(
        type=HuggingFaceCausalLM,
        abbr='',
        path='./Llama-2-7b-hf',
        tokenizer_path='./HydraLoRA-main/output/lora_model/sft_lora_model',
        peft_path='./HydraLoRA-main/output/lora_model/sft_lora_model',
        max_out_len=1024,
        batch_size=64,
        run_cfg=dict(num_gpus=1),
        model_kwargs=dict(
            torch_dtype='torch.bfloat16',
        )
    )
]

The only difference from what you mentioned is that I added a line :self.base_model = self.base_model.to(dtype=torch.bfloat16).to(self.device) at the end of def __init__ of class PeftModel during evalution, because not doing so would result in device and dtype errors. And I delete task_types in HydraLoRA-main/HydraLoRA/build_dataset.py. Should I remain it?

The final result is quite different from that in the paper, and even worse than before the fine-tuning, I would like to know where I set or understand wrong? Thank you very much for your help and time！

Clin0212 commented 3 days ago

Thank you for your question. The issue might be related to the evaluation process. Could you please share the MMLU score of the original Llama2-7B under your setup?

lightest1 commented 3 days ago

Thank you very much for your reply！ I deleted peft_path inconfigs/models/hf_llama/hf_llama2_7b.py and get the result weighted_average 38.89. As a supplement, my dataset is set with mmlu_ppl_ac766d.py as follows：

from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever, ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
from opencompass.datasets import MMLUDataset

# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar

mmlu_reader_cfg = dict(
    input_columns=['input', 'A', 'B', 'C', 'D'],
    output_column='target',
    train_split='dev')

mmlu_all_sets = [
    'college_biology',
    'college_chemistry',
    'college_computer_science',
    'college_mathematics',
    'college_physics',
    'electrical_engineering',
    'astronomy',
    'anatomy',
    'abstract_algebra',
    'machine_learning',
    'clinical_knowledge',
    'global_facts',
    'management',
    'nutrition',
    'marketing',
    'professional_accounting',
    'high_school_geography',
    'international_law',
    'moral_scenarios',
    'computer_security',
    'high_school_microeconomics',
    'professional_law',
    'medical_genetics',
    'professional_psychology',
    'jurisprudence',
    'world_religions',
    'philosophy',
    'virology',
    'high_school_chemistry',
    'public_relations',
    'high_school_macroeconomics',
    'human_sexuality',
    'elementary_mathematics',
    'high_school_physics',
    'high_school_computer_science',
    'high_school_european_history',
    'business_ethics',
    'moral_disputes',
    'high_school_statistics',
    'miscellaneous',
    'formal_logic',
    'high_school_government_and_politics',
    'prehistory',
    'security_studies',
    'high_school_biology',
    'logical_fallacies',
    'high_school_world_history',
    'professional_medicine',
    'high_school_mathematics',
    'college_medicine',
    'high_school_us_history',
    'sociology',
    'econometrics',
    'high_school_psychology',
    'human_aging',
    'us_foreign_policy',
    'conceptual_physics',
]

mmlu_datasets = []
for _name in mmlu_all_sets:
    _hint =  f'There is a single choice question about {_name.replace("_", " ")}. Answer the question by replying A, B, C or D.'
    question_overall = '{input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}'
    mmlu_infer_cfg = dict(
        # ice_template=dict(
        #     type=PromptTemplate,
        #     template={opt: f'{question_overall}\nAnswer: {opt}\n' for opt in ['A', 'B', 'C', 'D']},
        # ),
        prompt_template=dict(
            type=PromptTemplate,
            template={opt: f'{_hint}{question_overall}\nAnswer: {opt}' for opt in ['A', 'B', 'C', 'D']},
            # ice_token='</E>',
        ),
        retriever=dict(type=ZeroRetriever),
        inferencer=dict(type=PPLInferencer),
    )

    mmlu_eval_cfg = dict(evaluator=dict(type=AccwithDetailsEvaluator), )

    mmlu_datasets.append(
        dict(
            abbr=f'lukaemon_mmlu_{_name}',
            type=MMLUDataset,
            path='opencompass/mmlu',
            name=_name,
            reader_cfg=mmlu_reader_cfg,
            infer_cfg=mmlu_infer_cfg,
            eval_cfg=mmlu_eval_cfg,
        ))
del _name, _hint

As you instructed, I added two lines at the top of opencompass/opencompass/models/huggingface.py

import sys
sys.path.insert(0, './HydraLoRA-main/HydraLoRA/')

In addition to these, for the training of the databricks-dolly-15k, I modify the code to use context for input and response for output.

It seems that the original llama2-7b metrics are normal, and I didn't make any more changes. So I'm not sure what the problem is, is it something I loaded or set up wrong? Thanks again for your reply and time!

Clin0212 commented 2 days ago

Thank you for your question. It seems that your configuration is feasible. Since there have been significant updates to OpenCompass, I will debug your configuration again over the weekend.

lightest1 commented 1 day ago

Thank you very much for your patient answer! Look forward to your suggestions!

Clin0212 / HydraLoRA

about result on mmlu #10