Publish datasets on HuggingFace

flaviusburca commented 4 months ago

Are the actual datasets open-source ? Will they be published on HuggingFace ?

trebedea commented 4 months ago

We will publish the datasets as well in the following 2-3 weeks. Right now the focus is to release a Ro-Mistral-7b-Intstruct with a substantial improvement over Ro-Llama2-7b using the same recipe and datasets.

rennokki commented 4 months ago

@trebedea Are the training datasets mainly a RO sub-split of the actual datasets that contain multi-lingual data on which Llama-2 was trained on + some others? If so, any idea if a Llama-3 is on the road?

trebedea commented 4 months ago

We do open research, so the technical report including datasets used for training and finetuning is public: https://arxiv.org/abs/2405.07703

Our main aim is to identify a "recipe" which allows improvement of any generic LLM, meaning also Mistral / Mixtral or Llama-3. Right now we have promising results showing that Mistral-7B can be improved using the current "recipe".

Current recipe is (on short, detailed in the paper / technical report):

continual pre-training on Romanian data from CulturaX dataset (with some curation / filtering for quality)
instruct & chat fine-tuning using open datasets translated in Romanian

Hope this makes sense.

rennokki commented 4 months ago

@trebedea Yes, it does. Thanks! 👍🏼

MihaiMasala commented 1 month ago

The translated datasets are now available on HF.

OpenLLM-Ro / llama-recipes

Publish datasets on HuggingFace #2