💡 [REQUEST] - <title> finetuning data format question

lzl-mt commented 3 months ago

起始日期 | Start Date

No response

实现PR | Implementation PR

No response

摘要 | Summary

Hello, I would like to confirm what the form of finetuning data is. In https://qwen.readthedocs.io/en/latest/training/SFT/example.html, data format likes

However, in this examples: https://github.com/alibaba/Pai-Megatron-Patch/blob/main/examples/qwen1_5/README.md#Megatron-LM-Dense%E6%A8%A1%E5%9E%8B%E8%AE%AD%E7%BB%83%E6%B5%81%E7%A8%8B, dataset download and extracted like this:

Thanks!

基本示例 | Basic Example

No

缺陷 | Drawbacks

No

未解决问题 | Unresolved questions

No response

jklj077 commented 3 months ago

Hi!

I agree with you and they are indeed different!

The example at https://qwen.readthedocs.io/en/latest/training/SFT/example.html is maintained by us and the finetune.py script is also in this repo. As the original data format can be of high diversity, we have required data to be organized in a format similar to the OpenAI API, which is versatile and widely-used in the community.

The other example you showed is maintained by PAI (a different team from Alibaba Cloud) and does not apply to our codebase.

github-actions[bot] commented 2 months ago

This issue has been automatically marked as inactive due to lack of recent activity. Should you believe it remains unresolved and warrants attention, kindly leave a comment on this thread.

QwenLM / Qwen2