Construction of the FIM training data

deepseek-ai / DeepSeek-Coder

DeepSeek Coder: Let the Code Write Itself

https://coder.deepseek.com/

MIT License

6.01k stars 433 forks source link

Construction of the FIM training data #107

Open shatealaboxiaowang opened 5 months ago

shatealaboxiaowang commented 5 months ago

Hi， dear：

Thank you very much for your open source. Will the code of FIM dataset construction and training be made public? such as the number of lines or length of the code for Prefix, suffix, and middle. We would like to build on your model and fine-tune it on our own code data warehouse, especially to improve the FIM performance of our internal code.

Thx.

pkuzqh commented 5 months ago

You can see "https://github.com/EleutherAI/gpt-neox/blob/FIM-clean/megatron/data/gpt2_dataset.py#L339". We use the character-level disruption.

shatealaboxiaowang commented 5 months ago

You can see "https://github.com/EleutherAI/gpt-neox/blob/FIM-clean/megatron/data/gpt2_dataset.py#L339". We use the character-level disruption.

Thx，i will look at it.

mikelpzm commented 4 months ago

@shatealaboxiaowang were you able to construct a proper FIM dataset?

allenliu88 commented 1 month ago

You can see "https://github.com/EleutherAI/gpt-neox/blob/FIM-clean/megatron/data/gpt2_dataset.py#L339". We use the character-level disruption.

Thx，i will look at it.

Do you fine tune the model for fim successfuly? what does the FIM dataset look like，can you share your solutions? thanks very much.