Closed SefaZeng closed 4 months ago
Hi,
It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?
Hi,
It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?
@jklj077 Change the text to
Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?
The loss value is 4.11 which still very high for a language model.
Hi,
It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?
I am just wondering how to reproduce the Qwen2's results through lm_eval?
This is the MMLU score from lm_eval for Qwen2-1.5B: | Groups | Version | Filter | n-shot | Metric | Value | Stderr | |
---|---|---|---|---|---|---|---|---|
mmlu | N/A | none | 0 | acc | 0.2295 | ± | 0.0035 | |
- humanities | N/A | none | 5 | acc | 0.2421 | ± | 0.0062 | |
- other | N/A | none | 5 | acc | 0.2398 | ± | 0.0076 | |
- social_sciences | N/A | none | 5 | acc | 0.2171 | ± | 0.0074 | |
- stem | N/A | none | 5 | acc | 0.2125 | ± | 0.0073 |
Hi,
For MMLU, we follow the practice of Open LLM Leaderboard (which makes use of lm_evaluation_harness), the automatic evaluation results run by the HuggingFace Team for Qwen2-1.5B can be found at https://huggingface.co/datasets/open-llm-leaderboard/details_Qwen__Qwen2-1.5B.
This is the MMLU score from lm_eval for Qwen2-1.5B:
Groups Version Filter n-shot Metric Value Stderr mmlu N/A none 0 acc 0.2295 ± 0.0035
- humanities N/A none 5 acc 0.2421 ± 0.0062
- other N/A none 5 acc 0.2398 ± 0.0076
- social_sciences N/A none 5 acc 0.2171 ± 0.0074
- stem N/A none 5 acc 0.2125 ± 0.0073
I got the same result as @SefaZeng when I load the model with torch.float16, and it turns normal with float32, why? I use hendrycks's code,
Hi, @davendw49 can you try running in bf16? The model is trained in bf16 and the test is also done in bf16.
If you must use fp16, you may need to try different attention implemenation; otherwise, inf or nan may be encountered.
一个简单的脚本测试:
avg_loss 有 4.48,这个值太高了?