bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
431 stars 27 forks source link

Reproduce line-diff SantaCoder fine-tuned on CommitPackFT (Table 11) results #26

Closed JiyangZhang closed 8 months ago

JiyangZhang commented 8 months ago

Hi,

I am excited about line-diff SantaCoder model and was trying to reproduce the results in Table 11 in the appendix!

I am wondering which dataset did you use for finetuning SantaCoder. Based on the descriptions from the paper, you used the subset (Java, Js, Py) of COMMITPACKFT which after I processing by myself gives me ~129K rows. The name of the dataset is "bigcode/commits-pjj-2048" in the finetune.sh script in the repository.

Could you please share the bigcode/commits-pjj-2048 dataset? I would really appreciate it!

Best.

Muennighoff commented 8 months ago

Just made bigcode/commits-pjj-2048 public: https://huggingface.co/datasets/bigcode/commits-pjj-2048 but yeah it should be ~same as Java,Js,Py of CommitPackFT

JiyangZhang commented 8 months ago

Thank you very much!

Actually, if I sum the number of examples of Java, Js and Py of CommitPackFT, it gives me ~129K. This number can be computed from the dataset: https://huggingface.co/datasets/bigcode/commitpackft and Appendix C in the paper as well. (The dataset you shared contains ~1.8M examples but CommitPackFT only has 702K examples.

Maybe something is off here?

Muennighoff commented 8 months ago

Thank you very much!

Actually, if I sum the number of examples of Java, Js and Py of CommitPackFT, it gives me ~129K. This number can be computed from the dataset: https://huggingface.co/datasets/bigcode/commitpackft and Appendix C in the paper as well. (The dataset you shared contains ~1.8M examples but CommitPackFT only has 702K examples.

Maybe something is off here?

I think we used an earlier version of CommitPackFT with less strict filters for that one. So it corresponds to a few filters removed from https://github.com/bigcode-project/octopack/blob/main/dataset/commitpackft/commitpackft_filters1.py & https://github.com/bigcode-project/octopack/blob/main/dataset/commitpackft/commitpackft_filters2.py One can probably figure out which ones were removed by looking at the samples present.

Also cc @SivilTaram