What is the training data format of commitpack-ft and oasst when finetune codegeex2?

bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models

https://arxiv.org/abs/2308.07124

MIT License

438 stars 26 forks source link

What is the training data format of commitpack-ft and oasst when finetune codegeex2? #9

Open sxthunder opened 1 year ago

sxthunder commented 1 year ago

In your paper, commitpack using following format to train: Question: xxx Answer: xxx

but in codegeex2's vocabulary, no special token like added. I download the checkpoint of octogeex and using this format predict, the answer is wrong.

can you explain more specifily about how you transfer commitpack_ft and oasst to finetune data format? (what's the input and what's the output)

Thanks

Muennighoff commented 1 year ago

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:

Question: {subject}
{old_contents}

Answer: 
{new_contents}

For OASST:

Question: {input}

Answer: 
{output}

So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

sxthunder commented 1 year ago

We format samples into a simple Q & A format for OctoCoder & OctoGeeX:

For CommitPackFT:
Question: {subject}
{old_contents}

Answer: 
{new_contents}
For OASST:
Question: {input}

Answer: 
{output}
So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!

Thank you!

sxthunder commented 1 year ago

I have two other quesiotns

In your script ./finetuning/starcoder/finetune.py: I find out that training samples directly concats without padding, like pretrain stage. This is different from many finetune script. Is this just for speed up training process?
Instruction tuning on code pretrain model enables it understands human instructions, and improve its score on many benchmarks. But for code completion in IDE enviroment(Like colipot or codegeex), which kind of model is more suitable? Pretrain or instruction?

Muennighoff commented 1 year ago

Yes this is called packing. It's to make it more efficient.
For code completion in your IDE, where you just want suggestions to directly continue your code, a pretrained model is likely more suitable. I.e. I would recommend StarCoder, not OctoCoder in that case. However, if you want a model to do something specific for you, such as "Write a function to do bubble sort", I'd recommend OctoCoder. You might be able to get StarCoder to do it via comments, but then it might just end up writing # pass in the code or fail in other ways. Further, if you want to edit code or explain code, I'd also recommend OctoCoder.

sxthunder commented 1 year ago

Sorry to bother you again:

In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?
In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

Muennighoff commented 1 year ago

Sorry to bother you again:

In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?

In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?

Any questions are very welcome!

Unfortunately, we cannot open-source that framework, however, finetuning/starcoder/finetune.py should be able to train the same model.
Yes we found that performance plateaus after a few steps; We thus only use a subset of CommitPackFT (For OctoGeeX the exact dataset used for fine-tuning is uploaded here: https://huggingface.co/datasets/bigcode/co-manual)