THUDM / GLM-4

GLM-4 series: Open Multilingual Multimodal Chat LMs | 开源多语言多模态对话模型
Apache License 2.0
4.88k stars 403 forks source link

用给的示示例数据tools的数据微调,后面自动多了一个Tools:None,数据处理报异常 #576

Open mudeguo opened 1 week ago

mudeguo commented 1 week ago

System Info / 系統信息

python3.12,transformer 43;gpu 2080Ti22g*2

Who can help? / 谁可以帮助到您?

No response

Information / 问题信息

Reproduction / 复现过程

batchjed_conv: 50 conv: [{'role': 'system', 'content': '', 'tools': [{'type': 'function', 'function': {'name': 'get_recommended_books', 'description': "Get recommended books based on user's interests1", 'parameters': {'type': 'object', 'properties': {'interests': {'type': 'array', 'items': {'type': 'string'}, 'description': 'The interests to recommend books for'}}, 'required': ['interests']}}}]}, {'role': 'user', 'content': 'Hi, I am looking for some book recommendations. I am interested in history and science fiction.', 'tools': None}, {'role': 'assistant', 'content': '{"name": "get_recommended_books", "arguments": {"interests": ["history", "science fiction"]}}', 'tools': None}, {'role': 'observation', 'content': '{"books": ["Sapiens: A Brief History of Humankind by Yuval Noah Harari", "A Brief History of Time by Stephen Hawking", "Dune by Frank Herbert", "The Martian by Andy Weir"]}', 'tools': None}, {'role': 'assistant', 'content': 'Based on your interests in history and science fiction, I would recommend the following books: "Sapiens: A Brief History of Humankind" by Yuval Noah Harari, "A Brief History of Time" by Stephen Hawking, "Dune" by Frank Herbert, and "The Martian" by Andy Weir.', 'tools': None}] Map: 0%| | 0/50 00:00<?, ? examples/s: ╭────────────────────────────────────────────────────────────────────────────────────────────────── Traceback (most recent call last) ──────────────────────────────────────────────────────────────────────────────────────────────────╮ rank1: │ /data/projects/GLM-4/finetune_demo/finetune.py:419 in main │ rank1: │ │ rank1: │ 416 │ tokenizer, model = load_tokenizer_and_model(model_dir, peft_config=ft_config.peft_co │ rank1: │ 417 │ data_manager = DataManager(data_dir, ft_config.data_config) │ rank1: │ 418 │ │ rank1: │ ❱ 419 │ train_dataset = data_manager.get_dataset( │ rank1: │ 420 │ │ Split.TRAIN, │ rank1: │ 421 │ │ functools.partial( │ rank1: │ 422 │ │ │ process_batch, │

Expected behavior / 期待表现

希望能提供tools微调时的jsonl文件,能够跑通不报错。 trouble

sixsixcoder commented 6 days ago

这里有微调模板,https://zhipu-ai.feishu.cn/wiki/L1jpwBEqCiHocmkT3VzcQv5Znrg

mudeguo commented 2 days ago

这里有微调模板,https://zhipu-ai.feishu.cn/wiki/L1jpwBEqCiHocmkT3VzcQv5Znrg

就是用这个模版的数据做成train.jsonl, 一条复制了几百条,不知是否是因为数据一样的问题。如果官方给个jsonl文件就完美了。