Open ruchaa0112 opened 1 year ago
You can find code for this type of data preparation with sequence packing (concatenation of files separated by end tokens) for finetuning here. You can change the sequence length to 8192 since StarCoder supports it.
Thank you so much for your response @loubnabnl and also pointing to the data-preparation link. I had an allied question regarding the data preparation. I am fine-runing the star-coder on a proprietary pythonic library. Based on your experience do you think adding 'command - explanation' examples would help ?
For eg - Consider a pythonic library numpy. My dataset would contain **prompt = "Return a sorted copy of an array." completion = "numpy.sort(a, axis=-1, kind=None, order=None)"
Along with above mapping it might / might not contain the examples below - prompt = "Sort array [[1,4],[3,1]] along the last axis." completion= "a = np.array([[1,4],[3,1]]) \n np.sort(a)"**
Hi @loubnabnl and @ArmelRandy
Thank you for your work on StarCoder. I was interested in fine-tuning the StarCoder LLM for a pythonic library that hasn't been exposed to the Code LLM yet. I was preparing the dataset for fine-tuning job and had a few questions regarding it -
Thank you