OpenGVLab / OmniQuant

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.
MIT License
663 stars 50 forks source link

Is evaluation on MMLU dataset supported? #30

Closed brisker closed 8 months ago

brisker commented 10 months ago

Is evaluation on MMLU dataset supported? I can find corresponding codes here: But can not find any API which can be called.

ChenMnZ commented 10 months ago

If you want to get the results on MMLU, you should add --tasks hendrycksTest* in your command.

brisker commented 10 months ago

@ChenMnZ another puzzle: why is the Llama-1 fp16 accuracy in your paper different from the results reported in Llama-2 paper?

Results in your paper: image

Results in Llama-2 paper: image

ChenMnZ commented 9 months ago

@brisker Different process methods will lead to different results, you may get more details through LLAMA2's paper or repo.

In my paper, I evaluate all zero-shot tasks through, which is popular in the community.

You can also reproduce the reported results in my paper easily through lm-evaluation-harness.

Forival commented 9 months ago

@brisker Different process methods will lead to different results, you may get more details through LLAMA2's paper or repo.

In my paper, I evaluate all zero-shot tasks through, which is popular in the community.

You can also reproduce the reported results in my paper easily through lm-evaluation-harness.

How to load the quantized model to evaluate? I want to test other performances besides zero-shot, including memory usage and token processing efficiency etc...

tro0o commented 9 months ago

I evaluated five zero-shot tasks through lm-evaluation-harness, but the results I obtained show significant differences from the metrics presented in your paper. image

ChenMnZ commented 9 months ago

@Forival If you want to reproduce the zero-shot accuracy or perplexity of quantization models, you can refer reproduce evaluation results of our paper in for mor details. If you want to obtain practical memory reduction and speedup, you should leverage mlc-llm, refer for more details.

ChenMnZ commented 9 months ago

@tro0o There are two reasons.

brisker commented 9 months ago

@ChenMnZ adding--tasks hendrycksTest in command leads to NaN results on MMLU dataset:

fp16 test: python --model LLM/Llama-2-7b --epochs 0 --output_dir ./log/debug --tasks hendrycksTest --wbits 16 --abits 16

[2023-11-17 14:24:34 root]( 157): INFO {'results': {}, 'versions': {}, 'config': {'model': <models.LMClass.LMClass object at 0x7fa11e7be190>, 'model_args': None, 'num_fewshot': 0, 'limit': None, 'bootstrap_iters': 100000, 'description_dict': None}}
{'config': {'bootstrap_iters': 100000,
            'description_dict': None,
            'limit': None,
            'model': <models.LMClass.LMClass object at 0x7fa11e7be190>,
            'model_args': None,
            'num_fewshot': 0},
 'results': {},
 'versions': {}}
/usr/local/miniconda3/lib/python3.8/site-packages/numpy/core/ RuntimeWarning: Mean of empty slice.
  return _methods._mean(a, axis=axis, dtype=dtype,
/usr/local/miniconda3/lib/python3.8/site-packages/numpy/core/ RuntimeWarning: invalid value encountered in double_scalars
  ret = ret.dtype.type(ret / rcount)
[2023-11-17 14:24:34 root]( 184): INFO Average accuracy nan - STEM
[2023-11-17 14:24:34 root]( 184): INFO Average accuracy nan - humanities
[2023-11-17 14:24:34 root]( 184): INFO Average accuracy nan - social sciences
[2023-11-17 14:24:34 root]( 184): INFO Average accuracy nan - other (business, health, misc.)
[2023-11-17 14:24:34 root]( 186): INFO Average accuracy: nan
ChenMnZ commented 9 months ago

@brisker It is --tasks hendrycksTest*, but not --tasks hendrycksTest.

brisker commented 9 months ago

@ChenMnZ already replacing all the files in lm_eval folder, except for, but still different results for fp16 on winogrande datasets: llama2-7b-omniquant-fp16: 67.24 llama2-7b-lm-eval-original-code-fp16( 69.22 but in my understanding, the difference in can not influence the accuracy. Is there any other reasons causing the accuracy difference?

ChenMnZ commented 9 months ago

@brisker This repo load model with float16 type.

You can try to replace float16 with bfloat16.

brisker commented 9 months ago

@ChenMnZ but in the

llama2-7b-lm-eval-original-code-fp16( 69.22

I also use float16

ChenMnZ commented 9 months ago

@brisker You can try to replace the and modified the simple_evaluate function to fit original omniquant code.

I think such a difference accuracy is caused by some update in