bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
431 stars 27 forks source link

Line-diff script mentioned in appendix H #25

Closed JiyangZhang closed 8 months ago

JiyangZhang commented 8 months ago

Hi,

Thanks for the great work!

I am wondering if you could provide or refer me to the script where you used to produce the line-level diff given buggy code and fixed code for the code fix task. That would be of great help to me!

Thanks

JiyangZhang commented 8 months ago

I am wondering if this is the function to produce the line diff format? https://github.com/bigcode-project/octopack/blob/e885822b910099a96fc1611f51e6479e8ae81578/finetuning/santacoder/finetune.py#L168

SivilTaram commented 8 months ago

@JiyangZhang Thanks for your interest on our work! Yes that’s the function to generate line diff format! Feel free to ask follow-up questions!

JiyangZhang commented 8 months ago

Thanks for your quick response! I have another question w.r.t table 10 in Appendix G. What is the SantaCoder (131/236B tokens) Instruct Format? Is it the model that you pretrain on COMMITPACK with input format 'Question: Anwser:'? In this case, it should see Go, C++, Rust, right?

Best

huybery commented 8 months ago

@JiyangZhang Thanks for your attention.

  1. SantaCoderPack is an pre-trained model with the same architecture of SantaCoder on CommitPack using this format:<commit_before>code_before<commit_msg>message<commit_after>code_after
  2. It see Python, JavaScript, Java, C++, Go, Rust during pretraining.
  3. Btw, here is the model card about santacoderpack: https://huggingface.co/bigcode/santacoderpack
JiyangZhang commented 8 months ago

Hi @huybery, thanks for the reply. I was actually asking the first two rows in Table 10 as shown in the screenshot. What was the training data and the format? Screenshot from 2024-02-19 20-20-55

huybery commented 8 months ago

@JiyangZhang Oh, sorry for misunderstanding your question. The santacoder results in the table are from our direct testing without additional training. It supports only three languages (Python, JavaScript, Java) in pre-training (https://arxiv.org/abs/2301.03988). Here instruct's format can be referenced: https://github.com/bigcode-project/bigcode-evaluation-harness/blob/main/bigcode_eval/tasks/humanevalpack.py#L210

JiyangZhang commented 8 months ago

great, thank you!

JiyangZhang commented 8 months ago

Hi @huybery and @SivilTaram,

Sorry for bothering you again! I am excited about line-diff SantaCoder model and was trying to reproduce the results in Table 11. I am wondering which dataset did you use for finetuning SantaCoder. Based on the descriptions from the paper, you used the subset (Java, Js, Py) of COMMITPACKFT which after I processing by myself gives me ~129K rows. While here, there are 2.6M rows of examples.

Could you point me to the finetuning dataset you used? I would really appreciate it!

Best.

SivilTaram commented 8 months ago

@JiyangZhang Hi Jiyang, based on my experimental logs (and the code released on this repo), it should be fine-tuned on the dataset of https://huggingface.co/datasets/bigcode/commits-pjj-2048. The dataset is a relatively larger version compared to the final COMMITPACKFT. Sorry for the name confusion, the bigcode/commits-pjj-diff is a early version with the carper's diff format if I remember it correctly. We do not release a separate line-diff-format dataset on the Hub.

JiyangZhang commented 8 months ago

Thanks for your response!