OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
405 stars 58 forks source link

feat(dataset): add multi-turn dataset with template #162

Closed KaiLv69 closed 5 months ago

KaiLv69 commented 5 months ago

As title, add multi-turn dataset with template for training.

WillQvQ commented 5 months ago

prepare_chatml_messages 部分有一些 bug,这是我修改过的部分,供参考

prepared_messages = []
prepared_messages += [{"content": special_tokens_map['bos_token'], "require_loss": False}]
for message in messages['history']:
    if message['role'] == "assistant":
        prepared_messages += [{"content": '<|im_start|>' + message['role'] + '\n', "require_loss": False}]
        prepared_messages += [{"content": message['content'] + '<|im_end|>', "require_loss": True}]
        prepared_messages += [{"content": '\n', "require_loss": False}]
    else:
        prepared_messages += [
            {"content": f"<|im_start|>{message['role']}\n{message['content']}<|im_end|>\n", "require_loss": False}]
if add_generation_prompt:
    prepared_messages += [{"content": '<|im_start|>assistant\n', "require_loss": False}]
return prepared_messages