QwenLM / Qwen2.5-Coder

Qwen2.5-Coder is the code version of Qwen2.5, the large language model series developed by Qwen team, Alibaba Cloud.
3.1k stars 210 forks source link

预训练fim数据切割问题 #122

Closed boshi950912 closed 4 weeks ago

boshi950912 commented 1 month ago

你好,请问fim任务的数据,代码的切割方式是随机切割的,还是加入了一个规则进行切割,比如说'\n' ':'等特殊的token。

cyente commented 1 month ago

You can refer to our technical report.

https://arxiv.org/abs/2409.12186

shibo950912 commented 1 month ago

You can refer to our technical report.

https://arxiv.org/abs/2409.12186

你好,我想问的是,fim_middle中间那段的前后位置怎么确定的,是随机选的嘛

cyente commented 1 month ago

The middle part refers to the blank section of the code scripts that requires completion.

shibo950912 commented 1 month ago

The middle part refers to the blank section of the code scripts that requires completion.

预训练构造fim数据的时候,fim_middle那一段前后位置怎么确定的,字符随机切割的嘛

cyente commented 1 month ago

The method of segmenting for fim training is a topic well worth discussing. I'm not sure how to divide it to achieve better results; it's worth exploring.