Open sxthunder opened 1 year ago
We format samples into a simple Q & A format for OctoCoder & OctoGeeX:
For CommitPackFT:
Question: {subject}
{old_contents}
Answer:
{new_contents}
For OASST:
Question: {input}
Answer:
{output}
So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!
We format samples into a simple Q & A format for OctoCoder & OctoGeeX:
For CommitPackFT:
Question: {subject} {old_contents} Answer: {new_contents}
For OASST:
Question: {input} Answer: {output}
So we do not rely on any special tokens. We only use those special tokens for pretraining / fine-tuning on StarCoder & SantaCoder in the appendix. Let me know if something is unclear!
Thank you!
I have two other quesiotns
# pass
in the code or fail in other ways. Further, if you want to edit code or explain code, I'd also recommend OctoCoder.Sorry to bother you again:
Sorry to bother you again:
- In Readme.md, it shows that "OctoGeeX is finetuned based on [CodeGeeX2-6B (https://huggingface.co/THUDM/codegeex2-6b) using an internal training framework." Is there any plan to opensource this part? Does finetuning/starcoder/finetune.py can train the same model?
- In OctoGeex2's training hyperparameters, it shows octogeex2 only trains 50 Steps. But commitpack_ft have nearly 0.7M samples, Is this a mistake?
Any questions are very welcome!
finetuning/starcoder/finetune.py
should be able to train the same model.
In your paper, commitpack using following format to train: Question:xxx
Answer: xxx
but in codegeex2's vocabulary, no special token like added. I download the checkpoint of octogeex and using this format predict, the answer is wrong.
can you explain more specifily about how you transfer commitpack_ft and oasst to finetune data format? (what's the input and what's the output)
Thanks