Msagent fine-tune datasets formatting

InternLM / xtuner

An efficient, flexible and full-featured toolkit for fine-tuning LLM (InternLM2, Llama3, Phi3, Qwen, Mistral, ...)

https://xtuner.readthedocs.io/zh-cn/latest/

Apache License 2.0

3.72k stars 302 forks source link

Msagent fine-tune datasets formatting #210

Closed xiaohangguo closed 4 months ago

xiaohangguo commented 10 months ago

有三个问题，1.现在xtuner 支持lora训练msagent这种数据集，然后我已经复现了，请问能全量微调吗？ 2.如何用xtuner来构建训练的数据集呢？把如下部分位置加入<|System|><|User|><|Bot|>然后模拟input和output是不是就行？ 3.下面这个例子是不是写了一条三轮对话的类似msaget的训练语料？

<s> :你是一个可以调用外部工具的助手，可以使用的工具包括：
{'GoogleSearch': '一个可以从谷歌搜索结果的API。当你需要对于一个特定问题找到简短明了的回答时，可以使用它。输入应该是一个搜索查询。','PythonInterpreter': "用来执行Python代码。代码必须是一个函数，函数名必须得是'solution'，代码对应你的思考过程。"}
如果使用工具请遵循以下格式回复：
Thought:思考你当前步骤需要解决什么问题，是否需要使用工具
Action:工具名称，你的工具必须从 ['GoogleSearch', 'PythonInterpreter'] 中选择
Action Input:工具输入参数
工具返回按照以下格式回复：
Response:调用工具后的结果
如果你已经知道了答案，或者你不需要工具，请遵循以下格式回复
Thought:给出最终答案的思考过程
Final Answer:最终答案
开始!

用户:上海明天天气怎么样？
助手:Thought:为了回答这个问题，我需要查找最新的天气预报数据。
     Action:GoogleSearch
     Action Input:上海明天的天气预报
     Response:根据最新的天气预报，上海明天的天气是晴转多云，气温介于20到28度之间。

用户:请问3的7次方是多少？
助手:Thought:这是一个简单的数学计算问题，可以直接给出答案。
     Final Answer:3的7次方是2187。

用户:我需要将这个段落翻译成中文。
助手:Thought:这需要使用翻译工具或服务来完成。
     Action:PythonInterpreter
     Action Input:```python
     def solution(text):
         # 这里是一个示例，实际代码应该调用翻译API或使用翻译库
         translated_text = translate_to_chinese(text)
         return translated_text

 Response:这里是翻译后的中文段落。

xiaohangguo commented 10 months ago

我写了一个样例，是一个conversation样本，里面有三轮对话，不知是否正确？

{
"conversation":[
{"system" :"你是一个可以调用外部工具的助手，可以使用的工具包括：
{'GoogleSearch': '一个可以从谷歌搜索结果的API。当你需要对于一个特定问题找到简短明了的回答时，可以使用它。输入应该是一个搜索查询。','PythonInterpreter': "用来执行Python代码。代码必须是一个函数，函数名必须得是'solution'，代码对应你的思考过程。"}
如果使用工具请遵循以下格式回复：
Thought:思考你当前步骤需要解决什么问题，是否需要使用工具
Action:工具名称，你的工具必须从 ['GoogleSearch', 'PythonInterpreter'] 中选择
Action Input:工具输入参数
工具返回按照以下格式回复：
Response:调用工具后的结果
如果你已经知道了答案，或者你不需要工具，请遵循以下格式回复
Thought:给出最终答案的思考过程
Final Answer:最终答案
开始!",

"input":"上海明天天气怎么样？",
"output":"Thought:为了回答这个问题，我需要查找最新的天气预报数据。
     Action:GoogleSearch
     Action Input:上海明天的天气预报
     Response:根据最新的天气预报，上海明天的天气是晴转多云，气温介于20到28度之间。
Final Answer:上海明天的天气是晴转多云，气温介于20到28度之间。"
},
{
"input":"请问3的7次方是多少？"
"output":"Thought:这是一个简单的数学计算问题，可以直接给出答案。
     Final Answer:3的7次方是2187。"},

{"input":"我需要将这个段落翻译成中文。"
"output":"Thought:这需要使用翻译工具或服务来完成。
     Action:PythonInterpreter
     Action Input:```python
     def solution(text):
         # 这里是一个示例，实际代码应该调用翻译API或使用翻译库
         translated_text = translate_to_chinese(text)
         return translated_text

 Response:这里是翻译后的中文段落。xxxxxxx

Final Answer: 翻译后的结果为xxxxxxx。"}

] }

LZHgrla commented 10 months ago

@xiaohangguo

可以，参考 https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_e3.py

将数据集处理成如下格式（注意补充[Results]）。大的原则是，api 返回的结果放到"system"，用户输入的内容放到 "input"，模型生成的内容放到 "output"。同时，dataset_map_fn 设为 None，template_map_fn 保持不变，以插入 <|System|>、<|User|> 等字段。

[{
"conversation":[
    {
        "system": "你是一个可以调用外部工具的助手，可以使用的工具包括：\n{'GoogleSearch': '一个可以从谷歌搜索结果的API。当你需要对于一个特定问题找到简短明了的回答时，可以使用它。输入应该是一个搜索查询。','PythonInterpreter': \"用来执行Python代码。代码必须是一个函数，函数名必须得是'solution'，代码对应你的思考过程。\"}\n如果使用工具请遵循以下格式回复：\nThought:思考你当前步骤需要解决什么问题，是否需要使用工具\nAction:工具名称，你的工具必须从 ['GoogleSearch', 'PythonInterpreter'] 中选择\nAction Input:工具输入参数\n工具返回按照以下格式回复：\nResponse:调用工具后的结果\n如果你已经知道了答案，或者你不需要工具，请遵循以下格式回复\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n开始!",
        "input": "上海明天天气怎么样？",
        "output": "Thought:为了回答这个问题，我需要查找最新的天气预报数据。\nAction:GoogleSearch\nAction Input:上海明天的天气预报"
    },
    {
        "system": "Response:根据最新的天气预报，上海明天的天气是晴转多云，气温介于20到28度之间。",
        "input": "",
        "output": "Final Answer:[Results]"
    },
    {
        "system": "",
        "input": "请问3的7次方是多少？",
        "output": "Thought:这是一个简单的数学计算问题，可以直接给出答案。Final Answer:3的7次方是2187。"
    },
    ...
]
}]

见上

xiaohangguo commented 10 months ago

@xiaohangguo

可以，参考 https://github.com/InternLM/xtuner/blob/main/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_e3.py
将数据集处理成如下格式（注意补充[Results]）。大的原则是，api 返回的结果放到"system"，用户输入的内容放到 "input"，模型生成的内容放到 "output"。同时，dataset_map_fn 设为 None，template_map_fn 保持不变，以插入 <|System|>、<|User|> 等字段。

[{
    "conversation":[
        {
            "system": "你是一个可以调用外部工具的助手，可以使用的工具包括：\n{'GoogleSearch': '一个可以从谷歌搜索结果的API。当你需要对于一个特定问题找到简短明了的回答时，可以使用它。输入应该是一个搜索查询。','PythonInterpreter': \"用来执行Python代码。代码必须是一个函数，函数名必须得是'solution'，代码对应你的思考过程。\"}\n如果使用工具请遵循以下格式回复：\nThought:思考你当前步骤需要解决什么问题，是否需要使用工具\nAction:工具名称，你的工具必须从 ['GoogleSearch', 'PythonInterpreter'] 中选择\nAction Input:工具输入参数\n工具返回按照以下格式回复：\nResponse:调用工具后的结果\n如果你已经知道了答案，或者你不需要工具，请遵循以下格式回复\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n开始!",
            "input": "上海明天天气怎么样？",
            "output": "Thought:为了回答这个问题，我需要查找最新的天气预报数据。\nAction:GoogleSearch\nAction Input:上海明天的天气预报"
        },
        {
            "system": "Response:根据最新的天气预报，上海明天的天气是晴转多云，气温介于20到28度之间。",
            "input": "",
            "output": "Final Answer:[Results]"
        },
        {
            "system": "",
            "input": "请问3的7次方是多少？",
            "output": "Thought:这是一个简单的数学计算问题，可以直接给出答案。Final Answer:3的7次方是2187。"
        },
        ...
    ]
}]

见上

谢谢，么么么么么！！第二轮对话中的这个部分：

            "system": "Response:根据最新的天气预报，上海明天的天气是晴转多云，气温介于20到28度之间。",
            "input": "",
            "output": "Final Answer:[Results]"

其中[Result]是留给lagent中根据system中的response输出的吧？我按照"[Result]"这样做就行吧，还是说在这个部分也是模拟写一个比如： "output":"Final Answer:[天气预报显示，上海明天的天气是晴转多云，气温介于20到28度之间。]"

LZHgrla commented 10 months ago

@xiaohangguo "[Result]"这部分也需要模拟写好，这一轮对话的主要目的是让LLM学会根据API返回的结果梳理总结出答案

xiaohangguo commented 10 months ago

@xiaohangguo "[Result]"这部分也需要模拟写好，这一轮对话的主要目的是让LLM学会根据API返回的结果梳理总结出答案

明白，那就和我理解的一样。万分感谢！！！！！！！！！！！！！！you are my hero ,bro@LZHgrla

xiaohangguo commented 10 months ago

有一个新问题，system中，如果是python解释器运行的结果返回的结构和内容是怎么样的，有点不太懂这里要怎么写这样写对吗

{
    "conversation": [
        {
            "system": "你是一个可以调用外部工具的助手...",
            "input": "设计一个满足特定要求的NMOSFET单管放大器...",
            "output": "Thought:首先需要初始化设计的基本参数，如电源电压、负载电容等。这些参数将用于后续计算和仿真。\nAction:PythonInterpreter\nAction Input:\n```python\nfrom lut.lookuptable.MOSinfo import LUT_DB\n\ndef solution():\n    vdd = 1.2\n    vcm = vdd / 2\n    CL = 5e-12\n    RL = 2e3\n    GBW = 100e6\n    Av = 4\n    return vdd, vcm, CL, RL, GBW, Av\n```"
        },
        {
            "system": "PythonInterpreter执行结果：result:{'vdd': 1.2, 'vcm': 0.6, 'CL': 5e-12, 'RL': 2000, 'GBW': 100000000.0, 'Av': 4}}",
            "input": "",
            "output": "Thought:接下来，我们计算跨导gm，这是基于增益带宽积GBW和负载电容CL来确定的。\nAction:PythonInterpreter\nAction Input:\n```python\nimport numpy as np\n\ndef solution(CL, GBW):\n    gm = 2 * np.pi * GBW * CL\n    return gm\n```"
        },
        {
            "system": "{PythonInterpreter执行结果：result:{'gm': 0.0314}}}",
            "input": "",
            "output": "Thought:"
        },
      ……

    ]
}

LZHgrla commented 10 months ago

这个我觉得是无所谓的，只要训练、推理的格式一致即可。因此，可以根据lagent框架代码解释器返回的格式，来构造训练数据

有一个新问题，system中，如果是python解释器运行的结果返回的结构和内容是怎么样的，有点不太懂这里要怎么写这样写对吗

{
    "conversation": [
        {
            "system": "你是一个可以调用外部工具的助手...",
            "input": "设计一个满足特定要求的NMOSFET单管放大器...",
            "output": "Thought:首先需要初始化设计的基本参数，如电源电压、负载电容等。这些参数将用于后续计算和仿真。\nAction:PythonInterpreter\nAction Input:\n```python\nfrom lut.lookuptable.MOSinfo import LUT_DB\n\ndef solution():\n    vdd = 1.2\n    vcm = vdd / 2\n    CL = 5e-12\n    RL = 2e3\n    GBW = 100e6\n    Av = 4\n    return vdd, vcm, CL, RL, GBW, Av\n```"
        },
        {
            "system": "PythonInterpreter执行结果：result:{'vdd': 1.2, 'vcm': 0.6, 'CL': 5e-12, 'RL': 2000, 'GBW': 100000000.0, 'Av': 4}}",
            "input": "",
            "output": "Thought:接下来，我们计算跨导gm，这是基于增益带宽积GBW和负载电容CL来确定的。\nAction:PythonInterpreter\nAction Input:\n```python\nimport numpy as np\n\ndef solution(CL, GBW):\n    gm = 2 * np.pi * GBW * CL\n    return gm\n```"
        },
        {
            "system": "{PythonInterpreter执行结果：result:{'gm': 0.0314}}}",
            "input": "",
            "output": "Thought:"
        },
      ……

    ]
}

xiaohangguo commented 10 months ago

这个我觉得是无所谓的，只要训练、推理的格式一致即可。因此，可以根据lagent框架代码解释器返回的格式，来构造训练数据

有一个新问题，system中，如果是python解释器运行的结果返回的结构和内容是怎么样的，有点不太懂这里要怎么写这样写对吗

{
    "conversation": [
        {
            "system": "你是一个可以调用外部工具的助手...",
            "input": "设计一个满足特定要求的NMOSFET单管放大器...",
            "output": "Thought:首先需要初始化设计的基本参数，如电源电压、负载电容等。这些参数将用于后续计算和仿真。\nAction:PythonInterpreter\nAction Input:\n```python\nfrom lut.lookuptable.MOSinfo import LUT_DB\n\ndef solution():\n    vdd = 1.2\n    vcm = vdd / 2\n    CL = 5e-12\n    RL = 2e3\n    GBW = 100e6\n    Av = 4\n    return vdd, vcm, CL, RL, GBW, Av\n```"
        },
        {
            "system": "PythonInterpreter执行结果：result:{'vdd': 1.2, 'vcm': 0.6, 'CL': 5e-12, 'RL': 2000, 'GBW': 100000000.0, 'Av': 4}}",
            "input": "",
            "output": "Thought:接下来，我们计算跨导gm，这是基于增益带宽积GBW和负载电容CL来确定的。\nAction:PythonInterpreter\nAction Input:\n```python\nimport numpy as np\n\ndef solution(CL, GBW):\n    gm = 2 * np.pi * GBW * CL\n    return gm\n```"
        },
        {
            "system": "{PythonInterpreter执行结果：result:{'gm': 0.0314}}}",
            "input": "",
            "output": "Thought:"
        },
      ……

    ]
}

是的，我看了下lagent源码确认了python解释器代码的格式，但代码解释器返回的格式是什么形式，每太整明白。

LZHgrla commented 10 months ago

这个我觉得是无所谓的，只要训练、推理的格式一致即可。因此，可以根据lagent框架代码解释器返回的格式，来构造训练数据

有一个新问题，system中，如果是python解释器运行的结果返回的结构和内容是怎么样的，有点不太懂这里要怎么写这样写对吗

{
    "conversation": [
        {
            "system": "你是一个可以调用外部工具的助手...",
            "input": "设计一个满足特定要求的NMOSFET单管放大器...",
            "output": "Thought:首先需要初始化设计的基本参数，如电源电压、负载电容等。这些参数将用于后续计算和仿真。\nAction:PythonInterpreter\nAction Input:\n```python\nfrom lut.lookuptable.MOSinfo import LUT_DB\n\ndef solution():\n    vdd = 1.2\n    vcm = vdd / 2\n    CL = 5e-12\n    RL = 2e3\n    GBW = 100e6\n    Av = 4\n    return vdd, vcm, CL, RL, GBW, Av\n```"
        },
        {
            "system": "PythonInterpreter执行结果：result:{'vdd': 1.2, 'vcm': 0.6, 'CL': 5e-12, 'RL': 2000, 'GBW': 100000000.0, 'Av': 4}}",
            "input": "",
            "output": "Thought:接下来，我们计算跨导gm，这是基于增益带宽积GBW和负载电容CL来确定的。\nAction:PythonInterpreter\nAction Input:\n```python\nimport numpy as np\n\ndef solution(CL, GBW):\n    gm = 2 * np.pi * GBW * CL\n    return gm\n```"
        },
        {
            "system": "{PythonInterpreter执行结果：result:{'gm': 0.0314}}}",
            "input": "",
            "output": "Thought:"
        },
      ……

    ]
}

是的，我看了下lagent源码确认了python解释器代码的格式代码解释器返回的格式是什么形式，每太整明白。

https://github.com/InternLM/lagent/tree/main#run-a-react-agent-with-internlm 询问代码解释器相关的问题，然后直接打印 response.inner_steps 查看格式

xiaohangguo commented 10 months ago

试了一下，会输出这种东西

response = chatbot.chat('帮我实现100以内的奇数求和')
print(response.response)
print(response.inner_steps)

[
{'role': 'user', 'content': '帮我实现100以内的奇数求和'}, 
{'role': 'assistant', 'content': 'Thought: 这是一道计算题，需要用计算器Calculator计算一下100以内的奇数求和\nAction: PythonExecutor\nAction Input: def solution():\n    answer = 1+3+5+7+9+11+13+15+17+19+21+23+25+27+29+31+33+35+37+39+41+43+45+47+49\n    return answer'}, 
{'role': 'system', 'content': 'Response:def solution():\n    answer = 1+3+5+7+9+11+13+15+17+19+21+23+25+27+29+31+33+35+37+39+41+43+45+47+49\n    return answer\n'}, 
{'role': 'assistant', 'content': 'Thought: Base on the result of the code, the answer is:\nFinalAnswer: 100以内的奇数求和为1+3+5+7+9+11+13+15+17+19+21+23+25+27+29+31+33+35+37+39+41+43+45+47+49=1683'}, 
{'role': 'system', 'content': 'Response:Please follow the format\n'}, 
{'role': 'assistant', 'content': 'Thought: Base on the result of the code, the answer is:\nFinalAnswer: 100以内的奇数求和为1+3+5+7+9+11+13+15+17+19+21+23+25+27+29+31+33+35+37+39+41+43+45+47+49=1683'}, 
{'role': 'system', 'content': 'Response:Please follow the format\n'}, 
{'role': 'assistant', 'content': 'Thought: Base on the result of the code, the answer is:\nFinalAnswer: 100以内的奇数求和为1+3+5+7+9+11+13+15+17+19+21+23+25+27+29+31+33+35+37+39+41+43+45+47+49=1683'}, 
{'role': 'system', 'content': 'Response:Please follow the format\n'}]

他这里的system第二段system怎么没有计算而是重复输出了上面assistant输出的action input呢

xiaohangguo commented 10 months ago

from lagent.agents import ReAct
from lagent.actions import ActionExecutor, GoogleSearch, PythonInterpreter
from lagent.llms import HFTransformer

# Initialize the HFTransformer-based Language Model (llm) and provide the model name.
llm = HFTransformer('/public/home/lvshuhang/model_space/workspace/internlm_internlm-chat-7b')

# Initialize the Google Search tool and provide your API key.
# search_tool = GoogleSearch(api_key='Your SERPER_API_KEY')

# Initialize the Python Interpreter tool.
python_interpreter = PythonInterpreter()

# Create a chatbot by configuring the ReAct agent.
chatbot = ReAct(
    llm=llm,  # Provide the Language Model instance.
    action_executor=ActionExecutor(
        actions=[python_interpreter]  # Specify the actions the chatbot can perform.
    ),
)
# Ask the chatbot a mathematical question in LaTeX format.
response = chatbot.chat('用python帮我计算实现10以内的奇数求和')
print(response.inner_steps)
print(response.response)

[{'role': 'user', 'content': '用python帮我计算实现10以内的奇数求和'}, 
 {'role': 'assistant', 'content': 'Thought:这是一道计算题，需要用计算器Calculator计算一下10的奇数之和\nAction: PythonExecutor\nAction Input: def solution():\n    answer = 1+3+5+7+9\n    return answer'}, 
 {'role': 'system', 'content': 'Response:def solution():\n    answer = 1+3+5+7+9\n    return answer\n'}, 
 {'role': 'assistant', 'content': 'Thought: Base on the result of the code, the answer is:\nFinal Answer:10以内的奇数之和为25。'}]

先inner再response好像就不会重复了不过，现在还是不太理解为什么第二轮的system会重复第一轮assistant的output。归根结底，"system": "xx" 这部分可能是"system":"Response:xxx"但这个response的写法还是不是很懂