Open Instruction Generalist (OIG) Dataset is intended to train assistants that are part of the LAION-AI's family of assistants. OIG Assistants will be trained on the OIG dataset, a massive synthetic instructions with the goal of performing many diverse types of tasks.
We will have several versions of the OIG Assistant dataset ranging from a small (less than 1M) high quality synthetic dataset, to a massive synthetic instruction dataset. The research goal of OIG Assistant is to create high performing bots by using simple finetuning instead of RLHF.
We will create ever larger instruction datasets with the goal to generate eventually 1T medium quality tokens of instructions. The receipe for training is to do additional pretrain on some subset of the larger instruction sets, followed by a finetune on OIG-small or some other high quality small dataset.
We have also created a small subset of safety data to tag instructions for moderation. This dataset was created by volunteers and also curated and augmented from public datasets (see https://huggingface.co/datasets/ontocord/OIG-moderation)
The community has trained several models based on a subset of the OIG datasets including:
Available on huggingface.co.