Closed JiyangZhang closed 8 months ago
Just made bigcode/commits-pjj-2048
public: https://huggingface.co/datasets/bigcode/commits-pjj-2048
but yeah it should be ~same as Java,Js,Py of CommitPackFT
Thank you very much!
Actually, if I sum the number of examples of Java, Js and Py of CommitPackFT, it gives me ~129K. This number can be computed from the dataset: https://huggingface.co/datasets/bigcode/commitpackft and Appendix C in the paper as well. (The dataset you shared contains ~1.8M examples but CommitPackFT only has 702K examples.
Maybe something is off here?
Thank you very much!
Actually, if I sum the number of examples of Java, Js and Py of CommitPackFT, it gives me ~129K. This number can be computed from the dataset: https://huggingface.co/datasets/bigcode/commitpackft and Appendix C in the paper as well. (The dataset you shared contains ~1.8M examples but CommitPackFT only has 702K examples.
Maybe something is off here?
I think we used an earlier version of CommitPackFT with less strict filters for that one. So it corresponds to a few filters removed from https://github.com/bigcode-project/octopack/blob/main/dataset/commitpackft/commitpackft_filters1.py & https://github.com/bigcode-project/octopack/blob/main/dataset/commitpackft/commitpackft_filters2.py One can probably figure out which ones were removed by looking at the samples present.
Also cc @SivilTaram
Hi,
I am excited about line-diff SantaCoder model and was trying to reproduce the results in Table 11 in the appendix!
I am wondering which dataset did you use for finetuning SantaCoder. Based on the descriptions from the paper, you used the subset (Java, Js, Py) of COMMITPACKFT which after I processing by myself gives me ~129K rows. The name of the dataset is "bigcode/commits-pjj-2048" in the finetune.sh script in the repository.
Could you please share the bigcode/commits-pjj-2048 dataset? I would really appreciate it!
Best.