Dataset Preparation for Fine Tuning

ruchaa0112 commented 1 year ago

Hi @loubnabnl and @ArmelRandy

Thank you for your work on StarCoder. I was interested in fine-tuning the StarCoder LLM for a pythonic library that hasn't been exposed to the Code LLM yet. I was preparing the dataset for fine-tuning job and had a few questions regarding it -

Does my data need to be in {"prompt": , "completion": } format?
- Is there a requirement of how many characters can each prompt-completion pair can have?
Is it possible to concatenate the entire code using a separator between files into a single file and use that as a training data?
- Do I need to slice the code into 1024 character chunks as a part of data -preprocess?

Thank you

loubnabnl commented 1 year ago

You can find code for this type of data preparation with sequence packing (concatenation of files separated by end tokens) for finetuning here. You can change the sequence length to 8192 since StarCoder supports it.

ruchaa0112 commented 1 year ago

Thank you so much for your response @loubnabnl and also pointing to the data-preparation link. I had an allied question regarding the data preparation. I am fine-runing the star-coder on a proprietary pythonic library. Based on your experience do you think adding 'command - explanation' examples would help ?

For eg - Consider a pythonic library numpy. My dataset would contain **prompt = "Return a sorted copy of an array." completion = "numpy.sort(a, axis=-1, kind=None, order=None)"

Along with above mapping it might / might not contain the examples below - prompt = "Sort array [[1,4],[3,1]] along the last axis." completion= "a = np.array([[1,4],[3,1]]) \n np.sort(a)"**

bigcode-project / starcoder

Dataset Preparation for Fine Tuning #74