OFA-Sys / gsm8k-ScRel

Codes and Data for Scaling Relationship on Learning Mathematical Reasoning with Large Language Models
https://arxiv.org/abs/2308.01825
215 stars 16 forks source link

加载作者开源的 OFA-Sys/gsm8k-rft-llama7b-u13b 报错 #8

Closed Haskely closed 1 year ago

Haskely commented 1 year ago

总结: 可以复现 49+ 的分数,需要注意 1. 使用 LlamaTokenizer 2. pad_token 效果有问题,需要排除其干扰

https://github.com/Haskely/gsm8k-rft-llama7b-u13b_evaluation/tree/main

使用


from transformers import AutoTokenizer

model_path = "OFA-Sys/gsm8k-rft-llama7b-u13b" tokenizer = AutoTokenizer.from_pretrained(model_path)


会报错:

File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc return self.unk_token_id ^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id return self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc return self.unk_token_id ^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_base.py", line 1155, in unk_token_id return self.convert_tokens_to_ids(self.unk_token) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 250, in convert_tokens_to_ids return self._convert_token_to_id_with_added_voc(tokens) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/public/zhangzixin/conda_envs/nova/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 257, in _convert_token_to_id_with_added_voc return self.unk_token_id ^^^^^^^^^^^^^^^^^ RecursionError: maximum recursion depth exceeded



但是我检查本仓库源码,加载方式是一样的:https://github.com/OFA-Sys/gsm8k-ScRel/blob/f4d01761ec03d88a39486399c4617d29ee1dca7f/test.py#L185

我的 transformers 版本:`transformers                  4.31.0`

PS:我手动使用 `LlamaTokenizer.from_pretrained(model_path)` 不会报错, 暂时按这种方式测分了
GanjinZero commented 1 year ago

您可以试一下这样结果和我论文里汇报的一致否,因为时间关系。上传到hf的模型,我还没来得及测试。

Haskely commented 1 year ago

补充,目前测试用的最小流程集如下:

from transformers import (
    AutoModel,
    AutoTokenizer,
    LlamaTokenizer,
    LlamaTokenizerFast,
    LlamaForCausalLM,
    GenerationConfig,
)

model_path = "OFA-Sys/gsm8k-rft-llama7b-u13b"
tokenizer = LlamaTokenizer.from_pretrained(model_path, padding_side="left")
# tokenizer = LlamaTokenizerFast.from_pretrained(model_path) 报错!
# tokenizer = AutoTokenizer.from_pretrained(model_path) 报错!
print(tokenizer.pad_token)
print(tokenizer.bos_token)
print(tokenizer.unk_token)
print(tokenizer.eos_token)
print(tokenizer.truncation_side)
print(tokenizer.padding_side)
model = LlamaForCausalLM.from_pretrained(model_path).cuda()
questions = [
    "Janet’s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?",
    "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?",
]
prompt_no_input = (
    "Below is an instruction that describes a task. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{query}\n\n### Response:"
)
input_strs = [prompt_no_input.format(query=q) for q in questions]
input_ids = tokenizer(input_strs, padding=True, return_tensors="pt").input_ids.to(
    model.device
)
output_ids = model.generate(
    input_ids, generation_config=GenerationConfig(do_sample=False, max_length=512)
).tolist()
output_strs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)
print(output_strs)

上面代码的输出:

[
    "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nJanet\u2019s ducks lay 16 eggs per day. She eats three for breakfast every morning and bakes muffins for her friends every day with four. She sells the remainder at the farmers' market daily for $2 per fresh duck egg. How much in dollars does she make every day at the farmers' market?\n\n### Response:Janet\u2019s ducks lay 16 eggs per day and she eats 3 for breakfast so she has 16-3 = <<16-3=13>>13 eggs left\nJanet bakes muffins for her friends every day with 4 and she has 13 eggs left so she makes 13/4 = 3 duck muffins\nShe sells the remainder at the farmers' market daily for $2 per fresh duck egg and she makes 3 duck muffins a day so she makes 2*3 = $<<2*3=6>>6 per day\n#### 6",
    "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nA robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\n\n### Response:It takes 2*0.5=<<2*0.5=1>>1 bolt of white\nSo it takes 2+1=<<2+1=3>>3 bolts\n#### 3"
]

看起来是正常的。完整测试进行中...

更新:

判分采用的是提取最后一个数字比较大小:

def extract_last_num(text: str) -> float:
    res = re.findall(r"(\d+(\.\d+)?)", text)  # 匹配 123456.789
    if len(res) > 0:
        num_str = res[-1][0]
        return float(num_str)
    else:
        return None

测试出 Pass@1 准确率只有 444/1319 = 33.6618 %, 比论文中给出的 49.3 % 低很多。

Haskely commented 1 year ago

复现代码:https://github.com/Haskely/gsm8k-rft-llama7b-u13b_evaluation/blob/main/llama_gen_and_eval.py

错误样例:https://github.com/Haskely/gsm8k-rft-llama7b-u13b_evaluation/blob/0864fb080f567abd3586c42665522c641b8f0d91/output_fp32_bs32/wrong.json#L1-L37

[
    {
        "index": 1,
        "gsm8k_data": {
            "question": "A robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?",
            "answer": "It takes 2/2=<<2/2=1>>1 bolt of white fiber\nSo the total amount of fabric is 2+1=<<2+1=3>>3 bolts of fabric\n#### 3"
        },
        "input_str": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nA robe takes 2 bolts of blue fiber and half that much white fiber.  How many bolts in total does it take?\n\n### Response:",
        "output_str": "The robe takes 2*2=<<2*2=4>>4 bolts of blue and white.\nIt takes 4/2=<<4/2=2>>2 bolts of each color.\n#### 2",
        "extract_true_num": 3.0,
        "extract_pred_num": 2.0,
        "is_correct": false
    },
    {
        "index": 2,
        "gsm8k_data": {
            "question": "Josh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?",
            "answer": "The cost of the house and repairs came out to 80,000+50,000=$<<80000+50000=130000>>130,000\nHe increased the value of the house by 80,000*1.5=<<80000*1.5=120000>>120,000\nSo the new value of the house is 120,000+80,000=$<<120000+80000=200000>>200,000\nSo he made a profit of 200,000-130,000=$<<200000-130000=70000>>70,000\n#### 70000"
        },
        "input_str": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nJosh decides to try flipping a house.  He buys a house for $80,000 and then puts in $50,000 in repairs.  This increased the value of the house by 150%.  How much profit did he make?\n\n### Response:",
        "output_str": "The value of the house increased by 800000*.15=$<<80000*.15=12000>>12,000\nSo he gained 12000-5000=$<<12000-5000=7000>>7000\nSo his profit was 7000-8000=$<<7000-8000=1000>>1000\n#### 1000",
        "extract_true_num": 70000.0,
        "extract_pred_num": 1000.0,
        "is_correct": false
    },
    {
        "index": 4,
        "gsm8k_data": {
            "question": "Every day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy.  She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed.  In the afternoon, she gives her chickens another 25 cups of feed.  How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens?",
            "answer": "If each chicken eats 3 cups of feed per day, then for 20 chickens they would need 3*20=<<3*20=60>>60 cups of feed per day.\nIf she feeds the flock 15 cups of feed in the morning, and 25 cups in the afternoon, then the final meal would require 60-15-25=<<60-15-25=20>>20 cups of chicken feed.\n#### 20"
        },
        "input_str": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nEvery day, Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy.  She gives the chickens their feed in three separate meals. In the morning, she gives her flock of chickens 15 cups of feed.  In the afternoon, she gives her chickens another 25 cups of feed.  How many cups of feed does she need to give her chickens in the final meal of the day if the size of Wendi's flock is 20 chickens?\n\n### Response:",
        "output_str": "If Wendi feeds each of her chickens three cups of mixed chicken feed, containing seeds, mealworms and vegetables to help keep them healthy, and she gives the chickens their feed in three separate meals, then in total, she feeds each chicken 3*3=<<3*3=9>>9 cups of feed per day.\nIf the size of Wendi's flock of chickens is 20, and she feeds each chicken 9 cups of feed per day, then she needs 20*9=<<20*9=180>>180 cups of feed per day for her entire flock.\nIf she has already given her chickens 15 cups of feed in the morning, and another 25 cups in the afternoon, then she needs to give them another 180-15-25=<<180-15-25=140>>140 cups of feed in the final meal of the day.\n#### 140",
        "extract_true_num": 20.0,
        "extract_pred_num": 140.0,
        "is_correct": false
    },
    ...
]

看起来输出的样子是正常的,但是测下来都是 30 分左右,不清楚为何无法复现论文的结果。求作者提供一份能对 "OFA-Sys/gsm8k-rft-llama7b-u13b" 模型复现 49 分的测评代码。

GanjinZero commented 1 year ago

我的eval结果: gsm8k_sft_llama7b_eq7b.7b2.13b.13b2/raw_generation_greedy.json 650 49.27975739196361 1319

GanjinZero commented 1 year ago

[["Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nA robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?\n\n### Response:", "It takes 20.5=<<20.5=1>>1 bolt of white fiber\nSo it takes 2+1=<<2+1=3>>3 bolts in total\n#### 3"]]

GanjinZero commented 1 year ago
        generation_config = GenerationConfig(
            temperature=tempera,
            do_sample=args.do_sample,
            num_beams=return_seq_num,
            max_new_tokens=256,
            num_return_sequences=return_seq_num,
        )
GanjinZero commented 1 year ago

From my generation: [ [ "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nMark's car breaks down and he needs to get a new radiator. The cost for a new radiator is $400 but he goes to get it at a junk shop and gets it for 80% off. He then hires a mechanic to install it and it takes 3 hours at $50 an hour. How much did he pay?\n\n### Response:", "The radiator cost $400 but he gets 80% off so that's 400.80 = $<<400.80=320>>320 off\nThe radiator cost $400 but he only pays $320 so that's 400-320 = $<<400-320=80>>80\nThe mechanic works for 3 hours at $50 an hour so that's 350 = $<<350=150>>150\nThe radiator cost $80 and the mechanic charged $150 so that's 80+150 = $<<80+150=230>>230\n#### 230" ] ]

From your code: { "index": 1316, "gsm8k_data": { "question": "Mark's car breaks down and he needs to get a new radiator. The cost for a new radiator is $400 but he goes to get it at a junk shop and gets it for 80% off. He then hires a mechanic to install it and it takes 3 hours at $50 an hour. How much did he pay?", "answer": "The discount on the radiator was 400.8=$<<400.8=320>>320\nSo he paid 400-320=$<<400-320=80>>80\nThe mechanic charges 350=$<<350=150>>150\nSo in total he paid 80+150=$<<80+150=230>>230\n#### 230" }, "input_str": "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n\n### Instruction:\nMark's car breaks down and he needs to get a new radiator. The cost for a new radiator is $400 but he goes to get it at a junk shop and gets it for 80% off. He then hires a mechanic to install it and it takes 3 hours at $50 an hour. How much did he pay?\n\n### Response:", "output_str": "The radiator cost $400 but he gets 80% off so that's 400.80 = $<<400.80=320>>320 discount\nThe radiator cost $320 and he had to pay $50 an hour for 3 hours of labor so that's 320+503 = $<<320+503=470>>470\n#### 470", "extract_true_num": 230.0, "extract_pred_num": 470.0, "is_correct": false },

Haskely commented 1 year ago

可能是 attention mask 等没对齐的问题,我再修改下

Haskely commented 1 year ago

更新

测出了 49.12 % 的分数。唯一的改变是将 batch_size=1,这样排除了任何 special token 的干扰。看来之前是模型没有将 pad_token = "[PAD]" 在推理时真正的忽略掉导致掉点。

https://github.com/Haskely/gsm8k-rft-llama7b-u13b_evaluation/tree/main

但是看

from transformers import LlamaTokenizer

model_path = "OFA-Sys/gsm8k-rft-llama7b-u13b"
tokenizer = LlamaTokenizer.from_pretrained(model_path, padding_side="left")
print(tokenizer.pad_token)
print(tokenizer.pad_token_id)
tokenizer(["hello", "hello world, are you ok?"], padding=True)

输出:

You are using the legacy behaviour of the <class 'transformers.models.llama.tokenization_llama.LlamaTokenizer'>. This means that tokens that come after special tokens will not be properly handled. We recommend you to read the related pull request available at https://github.com/huggingface/transformers/pull/24565
[PAD]
0
{'input_ids': [[0, 0, 0, 0, 0, 0, 2, 22172], [2, 22172, 3186, 29892, 526, 366, 3431, 29973]], 'attention_mask': [[0, 0, 0, 0, 0, 0, 1, 1], [1, 1, 1, 1, 1, 1, 1, 1]]}

很正常, attention mask 就算不传 generate 函数也会自动处理好,未能搞清究竟哪里出了差错。

GanjinZero commented 1 year ago

很有可能是因为tokenizer的问题pad token传错了

AegeanYan commented 1 year ago

[PAD] 好像让词表变大了一个, 从32000变成32001了,不太确定作者这样做的原因。

GanjinZero commented 1 year ago

我们用的standford alpaca的非常早期的代码,所以有这个遗留问题