bigcode-project / starcoder2

Home of StarCoder2!
Apache License 2.0
1.71k stars 158 forks source link

Can starcoder2 be trained with a different language like TCL or lisp? #11

Open cmosguy opened 6 months ago

cmosguy commented 6 months ago

Hello @loubnabnl is it possible to get starcoder2 to learn TCL?

It was not part of the 30 languages so was curious if it's worth pursuing with SFT?

Also, is there FIM script you used for this version of starcoder2?

loubnabnl commented 6 months ago

Hi, the 15B model was trained on 600+ programming languages including TCL, here's the full list of languages: https://huggingface.co/datasets/bigcode/the-stack-v2/blob/main/language_stats.csv

The 7B and 3B though were only trained on 17 languages available in the paper

For FIM it's similar to StarCoder, you can use this code with the right tokens (they're different from SantaCoder, we use underscores instead of dashes)