Is evaluation on MMLU dataset supported?

brisker commented 10 months ago

Is evaluation on MMLU dataset supported? I can find corresponding codes here: https://github.com/OpenGVLab/OmniQuant/blob/main/categories.py But can not find any API which can be called.

ChenMnZ commented 10 months ago

If you want to get the results on MMLU, you should add --tasks hendrycksTest* in your command.

brisker commented 10 months ago

@ChenMnZ another puzzle: why is the Llama-1 fp16 accuracy in your paper different from the results reported in Llama-2 paper?

Results in your paper：

Results in Llama-2 paper:

ChenMnZ commented 9 months ago

@brisker Different process methods will lead to different results, you may get more details through LLAMA2's paper or repo.

In my paper, I evaluate all zero-shot tasks through https://github.com/EleutherAI/lm-evaluation-harness, which is popular in the community.

You can also reproduce the reported results in my paper easily through lm-evaluation-harness.

Forival commented 9 months ago

@brisker Different process methods will lead to different results, you may get more details through LLAMA2's paper or repo.

In my paper, I evaluate all zero-shot tasks through https://github.com/EleutherAI/lm-evaluation-harness, which is popular in the community.

You can also reproduce the reported results in my paper easily through lm-evaluation-harness.

How to load the quantized model to evaluate? I want to test other performances besides zero-shot, including memory usage and token processing efficiency etc...

tro0o commented 9 months ago

I evaluated five zero-shot tasks through lm-evaluation-harness, but the results I obtained show significant differences from the metrics presented in your paper.

ChenMnZ commented 9 months ago

@Forival If you want to reproduce the zero-shot accuracy or perplexity of quantization models, you can refer reproduce evaluation results of our paper in https://github.com/OpenGVLab/OmniQuant#usage for mor details. If you want to obtain practical memory reduction and speedup, you should leverage mlc-llm, refer https://github.com/OpenGVLab/OmniQuant/blob/main/runing_quantized_models_with_mlc_llm.ipynb for more details.

ChenMnZ commented 9 months ago

@tro0o There are two reasons.

First, lm-evaluation-harness have updated their code, which would affect the evaluated accuracy. You can use our repo to reproduce the results.
Second, for dataset has both acc and acc_norm (i.e. arc-e and arc-c), we reported the acc-norm.

brisker commented 9 months ago

@ChenMnZ adding--tasks hendrycksTest in command leads to NaN results on MMLU dataset:

fp16 test: python main.py --model LLM/Llama-2-7b --epochs 0 --output_dir ./log/debug --tasks hendrycksTest --wbits 16 --abits 16

[2023-11-17 14:24:34 root](main.py 157): INFO {'results': {}, 'versions': {}, 'config': {'model': <models.LMClass.LMClass object at 0x7fa11e7be190>, 'model_args': None, 'num_fewshot': 0, 'limit': None, 'bootstrap_iters': 100000, 'description_dict': None}}
{'config': {'bootstrap_iters': 100000,
            'description_dict': None,
            'limit': None,
            'model': <models.LMClass.LMClass object at 0x7fa11e7be190>,
            'model_args': None,
            'num_fewshot': 0},
 'results': {},
 'versions': {}}
/usr/local/miniconda3/lib/python3.8/site-packages/numpy/core/fromnumeric.py:3419: RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/miniconda3/lib/python3.8/site-packages/numpy/core/_methods.py:188: RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
[2023-11-17 14:24:34 root](main.py 184): INFO Average accuracy nan - STEM
[2023-11-17 14:24:34 root](main.py 184): INFO Average accuracy nan - humanities
[2023-11-17 14:24:34 root](main.py 184): INFO Average accuracy nan - social sciences
[2023-11-17 14:24:34 root](main.py 184): INFO Average accuracy nan - other (business, health, misc.)
[2023-11-17 14:24:34 root](main.py 186): INFO Average accuracy: nan

ChenMnZ commented 9 months ago

@brisker It is --tasks hendrycksTest*, but not --tasks hendrycksTest.

brisker commented 9 months ago

@ChenMnZ already replacing all the files in lm_eval folder, except for evaluator.py, but still different results for fp16 on winogrande datasets: llama2-7b-omniquant-fp16: 67.24 llama2-7b-lm-eval-original-code-fp16(https://github.com/EleutherAI/lm-evaluation-harness): 69.22 but in my understanding, the difference in evaluator.py can not influence the accuracy. Is there any other reasons causing the accuracy difference?

ChenMnZ commented 9 months ago

@brisker This repo load model with float16 type. https://github.com/OpenGVLab/OmniQuant/blob/834847adcee9575b89cd14ed2a3623c770743b4a/models/LMClass.py#L26

You can try to replace float16 with bfloat16.

brisker commented 9 months ago

@ChenMnZ but in the

llama2-7b-lm-eval-original-code-fp16(https://github.com/EleutherAI/lm-evaluation-harness): 69.22

I also use float16

ChenMnZ commented 9 months ago

@brisker You can try to replace the evaluator.py and modified the simple_evaluate function to fit original omniquant code.

I think such a difference accuracy is caused by some update in evaluator.py.

OpenGVLab / OmniQuant

Is evaluation on MMLU dataset supported? #30