Add Unsupported Languages to Base Model

Yesterday, I was talking to @andreaskoepf on discord about how to add a new language to Base LLM.

Today I saw this comment from @somerandomguyontheweb:

Hi @pourmand1376, sorry for a slighly off-topic question: could you please share any details on how your friend managed to fine-tune LLaMA on text-only dataset, without instructions? I'm interested in doing the same thing with Belarusian Wikipedia, but so far I've only seen tutorials on how to instruct-tune LLaMA, and Wikipedia articles as such don't contain clearly delimited prompts and responses. Could you please briefly describe the approach? Thanks in advance for any comments.

It seems that there are others like me who would like to fine-tune LLMs for unsupported languages like Persian.

This can be the place to discuss it. About asked question, I only know that he used this repository as the base and changes lots of things to make it work. I will ask him to give further details.

However, I think this repo can potentially serve as a repo for training base LLMs also.

I think we need a clear guide for people like me on how to do this thing. What I've seen so far, is that the Open-assistant team has done a great job for SFT fine-tuning. But there seems to be no code for fine-tuning base LLMs for other languages.

LAION-AI / Open-Assistant

Add Unsupported Languages to Base Model #3636