bigcode-project / octopack

🐙 OctoPack: Instruction Tuning Code Large Language Models
https://arxiv.org/abs/2308.07124
MIT License
420 stars 27 forks source link

Commit message from CommitPack unused? #23

Closed SeanHeelan closed 9 months ago

SeanHeelan commented 9 months ago

Hey folks,

Nice work! Something came to mind as I browsed your code: it looks like you only used the commit subject from CommitPack during training of OctoCoder. Is that correct? (I'm concluding it based on what you have said in #9 regarding the format, and the contents of the repository here).

Did you guys experiment with using the commit message as well? Or was there a reason you decided not to use it?

Thanks!

Muennighoff commented 9 months ago

Yes that's correct. The reason is that the message is usually exactly the same. If it's not the same it often includes external references which we don't want.

You can browse samples here: https://huggingface.co/datasets/bigcode/commitpackft/viewer/kotlin?row=1