Open lightest1 opened 4 days ago
Thank you for your question. The issue might be related to the evaluation process. Could you please share the MMLU score of the original Llama2-7B under your setup?
Thank you very much for your reply!
I deleted peft_path
inconfigs/models/hf_llama/hf_llama2_7b.py
and get the result weighted_average 38.89.
As a supplement, my dataset is set with mmlu_ppl_ac766d.py
as follows:
from opencompass.openicl.icl_prompt_template import PromptTemplate
from opencompass.openicl.icl_retriever import FixKRetriever, ZeroRetriever
from opencompass.openicl.icl_inferencer import PPLInferencer
from opencompass.openicl.icl_evaluator import AccwithDetailsEvaluator
from opencompass.datasets import MMLUDataset
# None of the mmlu dataset in huggingface is correctly parsed, so we use our own dataset reader
# Please download the dataset from https://people.eecs.berkeley.edu/~hendrycks/data.tar
mmlu_reader_cfg = dict(
input_columns=['input', 'A', 'B', 'C', 'D'],
output_column='target',
train_split='dev')
mmlu_all_sets = [
'college_biology',
'college_chemistry',
'college_computer_science',
'college_mathematics',
'college_physics',
'electrical_engineering',
'astronomy',
'anatomy',
'abstract_algebra',
'machine_learning',
'clinical_knowledge',
'global_facts',
'management',
'nutrition',
'marketing',
'professional_accounting',
'high_school_geography',
'international_law',
'moral_scenarios',
'computer_security',
'high_school_microeconomics',
'professional_law',
'medical_genetics',
'professional_psychology',
'jurisprudence',
'world_religions',
'philosophy',
'virology',
'high_school_chemistry',
'public_relations',
'high_school_macroeconomics',
'human_sexuality',
'elementary_mathematics',
'high_school_physics',
'high_school_computer_science',
'high_school_european_history',
'business_ethics',
'moral_disputes',
'high_school_statistics',
'miscellaneous',
'formal_logic',
'high_school_government_and_politics',
'prehistory',
'security_studies',
'high_school_biology',
'logical_fallacies',
'high_school_world_history',
'professional_medicine',
'high_school_mathematics',
'college_medicine',
'high_school_us_history',
'sociology',
'econometrics',
'high_school_psychology',
'human_aging',
'us_foreign_policy',
'conceptual_physics',
]
mmlu_datasets = []
for _name in mmlu_all_sets:
_hint = f'There is a single choice question about {_name.replace("_", " ")}. Answer the question by replying A, B, C or D.'
question_overall = '{input}\nA. {A}\nB. {B}\nC. {C}\nD. {D}'
mmlu_infer_cfg = dict(
# ice_template=dict(
# type=PromptTemplate,
# template={opt: f'{question_overall}\nAnswer: {opt}\n' for opt in ['A', 'B', 'C', 'D']},
# ),
prompt_template=dict(
type=PromptTemplate,
template={opt: f'{_hint}{question_overall}\nAnswer: {opt}' for opt in ['A', 'B', 'C', 'D']},
# ice_token='</E>',
),
retriever=dict(type=ZeroRetriever),
inferencer=dict(type=PPLInferencer),
)
mmlu_eval_cfg = dict(evaluator=dict(type=AccwithDetailsEvaluator), )
mmlu_datasets.append(
dict(
abbr=f'lukaemon_mmlu_{_name}',
type=MMLUDataset,
path='opencompass/mmlu',
name=_name,
reader_cfg=mmlu_reader_cfg,
infer_cfg=mmlu_infer_cfg,
eval_cfg=mmlu_eval_cfg,
))
del _name, _hint
As you instructed, I added two lines at the top of opencompass/opencompass/models/huggingface.py
import sys
sys.path.insert(0, './HydraLoRA-main/HydraLoRA/')
In addition to these, for the training of the databricks-dolly-15k, I modify the code to use context for input and response for output.
It seems that the original llama2-7b metrics are normal, and I didn't make any more changes. So I'm not sure what the problem is, is it something I loaded or set up wrong? Thanks again for your reply and time!
Thank you for your question. It seems that your configuration is feasible. Since there have been significant updates to OpenCompass, I will debug your configuration again over the weekend.
Thank you very much for your patient answer! Look forward to your suggestions!
Hello! Thank you for your work! I have conducted a version of fine tuning training on databricks-dolly-15k according to your script Settings. When I tried to evaluate mmlu on opencompass, I found that the result weighted_average was 38.38, which was even worse than the original llama2-7b. I don't know where I set the wrong, can you help me to confirm? my fine-tuning.sh:
I choose `opencompass/configs/eval_hf_llama2.py' for evaluating scripts, opencompass.configs.datasets.mmlu.mmlu_ppl_ac766d.py for dataset scripts. And my model scripts is like:
The only difference from what you mentioned is that I added a line :
self.base_model = self.base_model.to(dtype=torch.bfloat16).to(self.device)
at the end ofdef __init__
ofclass PeftModel
during evalution, because not doing so would result in device and dtype errors. And I deletetask_types
inHydraLoRA-main/HydraLoRA/build_dataset.py
. Should I remain it?The final result is quite different from that in the paper, and even worse than before the fine-tuning, I would like to know where I set or understand wrong? Thank you very much for your help and time!