MMLU result can not be reproduced

xh-yuan commented 1 month ago

I follow the description from prosparse-7B and test the Acc on MMLU with ultraeval. MMLU average Acc is 41.69 but paper reports 45.21.

Here is one sample eval configuration:

{
        "task_name": "mmlu_high-school-microeconomics_gen",
        "path": "datasets/mmlu/data/high-school-microeconomics.jsonl",
        "description": "The following are multiple choice questions (with answers) about high_school_microeconomics.\n\n",
        "transform": "datasets/mmlu/transform_gen_v1.py",
        "fewshot": 5,
        "generate": {
            "method": "generate",
            "params": ""
        },
        "postprocess": "",
        "metric": {
            "accuracy": {
                "evaluation": {
                    "type": "prefix_match"
                }
            }
        }
    }

generation_config:

{
    "bos_token_id": 1,
    "do_sample": true,
    "eos_token_id": 2,
    "pad_token_id": 0,
    "temperature": 0.6,
    "max_new_tokens": 10,
    "top_p": 0.9,
    "transformers_version": "4.31.0.dev0"
}

prosparse-7B configuration:

{
    "_name_or_path": "SparseLLM/prosparse-llama-2-7b",
    "architectures": [
        "SparseLlamaForCausalLM"
    ],
    "auto_map": {
        "AutoConfig": "configuration_sparsellama.SparseLlamaConfig",
        "AutoModel": "modeling_sparsellama.SparseLlamaForCausalLM",
        "AutoModelForCausalLM": "modeling_sparsellama.SparseLlamaForCausalLM"
    },
    "bos_token_id": 1,
    "eos_token_id": 2,
    "hidden_act": "relu",
    "hidden_act_param": 0.01,
    "hidden_size": 4096,
    "initializer_range": 0.02,
    "intermediate_size": 11008,
    "max_position_embeddings": 4096,
    "model_type": "sparsellama",
    "num_attention_heads": 32,
    "num_hidden_layers": 32,
    "num_key_value_heads": 32,
    "pad_token_id": 0,
    "pretraining_tp": 1,
    "rms_norm_eps": 1e-05,
    "rope_scaling": null,
    "tie_word_embeddings": false,
    "torch_dtype": "bfloat16",
    "transformers_version": "4.31.0.dev0",
    "use_cache": true,
    "vocab_size": 32000,
    "max_length": 4096
}

Raincleared-Song commented 1 month ago

It is somewhat hard for me to tell what the problem is, as evaluation is sensitive to a variety of factors (e.g, the vLLM version, the CUDA version, the generation config). I can provide the UltraEval version I used in the attachment. (ultraeval-07f99f7e.zip) The evaluation command is:

pip install .; python data_process.py; bash scripts/run_paper.sh --model_size 7b --port --ckpt_path --output_base_path evaluation --hidden_act relu --print_only True --is_hf True

I also find that you set the max_new_tokens to 10 in your generation config, which might cause problems.

xh-yuan commented 1 month ago

Thanks for the code! The result has been reproduced.

Raincleared-Song / sparse_gpu_operator

MMLU result can not be reproduced #9