QwenLM / Qwen2.5

Qwen2.5 is the large language model series developed by Qwen team, Alibaba Cloud.
9.31k stars 572 forks source link

Qwen2-1.5B loss 很高 #524

Closed SefaZeng closed 4 months ago

SefaZeng commented 4 months ago

一个简单的脚本测试:

#coding:utf-8
import sys 
from transformers import AutoModelForCausalLM, AutoTokenizer
from torch.nn import CrossEntropyLoss
import torch

model_path = "." 

model = AutoModelForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)

text = "Question: Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?\nAnswer: by freezing the leaves on"
loss_func = CrossEntropyLoss(reduction="none")

input_ids = tokenizer(text, return_tensors='pt')                                                                                                                                                                                                                                                   
labels = input_ids['input_ids'][:, 1:] 

output = model(**input_ids)

logits = output.logits[:,:-1]
loss = loss_func(logits.transpose(1, 2), labels)
num_tokens = input_ids['input_ids'].size(1)
avg_loss = torch.sum(loss).item() / num_tokens

print(avg_loss)

avg_loss 有 4.48,这个值太高了?

jklj077 commented 4 months ago

Hi,

It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?

SefaZeng commented 4 months ago

Hi,

It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?

@jklj077 Change the text to

Darryl learns that freezing temperatures may help cause weathering. Which statement explains how freezing temperatures most likely cause weathering?

The loss value is 4.11 which still very high for a language model.

SefaZeng commented 4 months ago

Hi,

It appears that you were using the base model on a Question-Answer pair. Could you please use natural texts?

I am just wondering how to reproduce the Qwen2's results through lm_eval?

SefaZeng commented 4 months ago
This is the MMLU score from lm_eval for Qwen2-1.5B: Groups Version Filter n-shot Metric Value Stderr
mmlu N/A none 0 acc 0.2295 ± 0.0035
- humanities N/A none 5 acc 0.2421 ± 0.0062
- other N/A none 5 acc 0.2398 ± 0.0076
- social_sciences N/A none 5 acc 0.2171 ± 0.0074
- stem N/A none 5 acc 0.2125 ± 0.0073
jklj077 commented 4 months ago

Hi,

For MMLU, we follow the practice of Open LLM Leaderboard (which makes use of lm_evaluation_harness), the automatic evaluation results run by the HuggingFace Team for Qwen2-1.5B can be found at https://huggingface.co/datasets/open-llm-leaderboard/details_Qwen__Qwen2-1.5B.

davendw49 commented 4 months ago

This is the MMLU score from lm_eval for Qwen2-1.5B:

Groups Version Filter n-shot Metric Value Stderr mmlu N/A none 0 acc 0.2295 ± 0.0035

  • humanities N/A none 5 acc 0.2421 ± 0.0062
  • other N/A none 5 acc 0.2398 ± 0.0076
  • social_sciences N/A none 5 acc 0.2171 ± 0.0074
  • stem N/A none 5 acc 0.2125 ± 0.0073

I got the same result as @SefaZeng when I load the model with torch.float16, and it turns normal with float32, why? I use hendrycks's code,

jklj077 commented 4 months ago

Hi, @davendw49 can you try running in bf16? The model is trained in bf16 and the test is also done in bf16.

jklj077 commented 4 months ago

If you must use fp16, you may need to try different attention implemenation; otherwise, inf or nan may be encountered.