Curate SFT-9 dataset mixes

olliestanley commented 1 year ago

Iterate on the SFT-8 dataset mixes to create pretraining and final SFT mixes for SFT-9. This requires investigating the quality and usefulness of the datasets. Community input welcome below. See the sft8_training branch for the code state corresponding to the below SFT-8 configs.

SFT-8 pretraining mix

``` datasets: - gpteacher_roleplay: val_split: 0.05 - red_pajama: fraction: 0.25 max_val_set: 1000 - wizardlm_70k: val_split: 0.05 max_val_set: 500 - joke: val_split: 0.05 - poem_instructions: val_split: 0.025 - oa_stackexchange: val_split: 0.05 fraction: 0.1 max_val_set: 1000 - tell_a_joke: val_split: 0.05 max_val_set: 250 - webgpt: val_split: 0.05 max_val_set: 250 - gpt4all: val_split: 0.01 max_val_set: 1000 - alpaca_gpt4: val_split: 0.025 max_val_set: 250 - code_alpaca: val_split: 0.05 max_val_set: 250 - vicuna: max_val_set: 250 - oig_file: source_url: https://huggingface.co/datasets/laion/OIG/resolve/main/unified_chip2.jsonl max_count: 10000 min_length: 250 val_split: 0.05 max_val_set: 250 - minimath: val_split: 0.05 - humaneval_mbpp_codegen_qa: val_split: 0.05 - humaneval_mbpp_testgen_qa: val_split: 0.05 - grade_school_math_instructions: val_split: 0.05 - recipes: val_split: 0.05 - cmu_wiki_qa: val_split: 0.05 - oa_wiki_qa_bart_10000row: val_split: 0.05 max_val_set: 250 - prosocial_dialogue: fraction: 0.1 max_val_set: 250 - explain_prosocial: fraction: 0.075 max_val_set: 250 - soda: fraction: 0.25 max_val_set: 1000 - oa_leet10k: val_split: 0.05 max_val_set: 250 - dolly15k: val_split: 0.05 max_val_set: 300 ```

SFT-8 final SFT mix

``` datasets: - oasst_export: lang: "bg,ca,cs,da,de,en,es,fr,hr,hu,it,nl,pl,pt,ro,ru,sl,sr,sv,uk" input_file_path: 2023-05-06_OASST_labels.jsonl.gz val_split: 0.05 - vicuna: val_split: 0.05 max_val_set: 800 fraction: 0.4 - dolly15k: val_split: 0.05 max_val_set: 300 - grade_school_math_instructions: val_split: 0.05 - code_alpaca: val_split: 0.05 max_val_set: 250 - red_pajama: fraction: 0.05 max_val_set: 1000 - wizardlm_70k: val_split: 0.05 max_val_set: 500 fraction: 0.4 - poem_instructions: fraction: 0.5 val_split: 0.025 ```

Leading on this: @0x22almostEvil

Some initial requests from community include removal or reduction/filtering of prosocial_dialogue and explain_prosocial datasets from pretraining.

echo0x22 commented 1 year ago

I'm here!

r7l commented 1 year ago

Would it be possible to use the Starcoder dataset?

olliestanley commented 1 year ago

Would it be possible to use the Starcoder dataset?

Currently we include some RedPajama data with a language modelling objective during SFT to try to prevent catastrophic forgetting of pretraining knowledge. Maybe it would be possible to do something similar with StarCoder data. But I don't think we could train on the whole dataset, that would just be hugely expensive and more in the realms of foundation model pretraining than assistant finetuning.

r7l commented 1 year ago

Understandable. I'd assume that a large portion of current OA users are coders. So it might be reasonable for the start to have a good understanding for coding. It's pretty ok for SFT-7 already but there will always be room for improvement.

echo0x22 commented 1 year ago

Yeah, we might collect some good datasets for coding as well.

I'm currently doing a work with collecting reasoning, logic and semantics ones, as I noticed there are some problems in this field.

marcelklehr commented 1 year ago

Is it possible to train a model without any datasets that are legally questionable such as the code_alpaca and gpt4all which are trained on the OpenAI API AFAIK, which doesn't allow training models with its output? A fully open source model like this would be very helpful.

olliestanley commented 1 year ago

Since the results of the Guanaco paper I think it is clear SFT-9 should use a much smaller set of finetuning data and focus on high quality. I suggest we try a run dropping synthetic datasets, with maybe exceptions for those synthetic datasets which are clearly high-quality

andreaskoepf commented 1 year ago

While we definitely should use QLoRA (a groundbreaking result for the whole ML community) and only try a super-high quality final fine-tuning run (like OA top-1 threads, i.e. as it was done for Guanaco) I think the total situation is not that simple.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage. Also probably we were a bit afraid of overfitting and with a too small dataset (and we saw that naive Dropout for the larger LLaMA models didn't work as well as for pythia .. Lima showed a better approach). And since QLoRA allows much faster iterations they could try a lot of different configurations in a short amount of time (rapid feedback loop is extremely beneficial if you have the right eval metrics).

In the fine-print of his twitter mega-thread Tim Dettmers writes:

"Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things
"its really bad at math" (tweet)

What we clearly see is that the style of the model output can be greatly modified already with 1k (Lima) or 10k (QLoRA) examples. Whether additional "pre-training" is beneficial for capabilities or not was IMO not analyzed. We observed that pre-training has clearly an influence (e.g. negative with prosocial and positive with grade-school-math). Also we know that our SFT7e3 model although it most of the time fails to generate rhyming poems is our best model for following instructions and handling plugin-requests. The larger LLaMA models were pre-trained on 1.4T tokens .. the question is of course whether adding further data sets like synthetic instructions improve the desired model behavior or if they have detrimental effects. For pro-social "safety" datasets we concluded that their effect is overall negative and that they should be removed from future run but for others it is less clear and needs further analysis.

I see two obvious solutions/approaches for chat vs. plugins:

use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode
use multiple specialized models, e.g. one for chat and another one for instruction following

olliestanley commented 1 year ago

I agree that there is a clear distinction between datasets useful for chat-tuning vs instruction-following, but have a few points here.

We already followed a 2-stage training approach. Guanaco of course goes a step further and trained only on the highest quality OA-data. When we decided to use the full OA set (I implemented a top-k thread filter which was not used) the idea was to create a diverse SFT output as input to the RL stage.

This makes sense, but it seems to me that even if we continue the 2-stage approach we can most likely get sufficiently diverse outputs for RL even with a highly filtered OA set.

In the fine-print of his twitter mega-thread Tim Dettmers writes:

"Our main finding here: (1) instruction tuning datasets are good for instruction following but bad for chatbot performance; (2) you can create a 99.3% of ChatGPT performance level chatbot with QLoRA in just 24 hours of fine-tuning!" (tweet) -> i.e instruction following and "fluffy" chat are two different things

"its really bad at math" (tweet)

Yes, my suggestion would be to retain these high-quality instruction-following datasets (e.g. math instructions, poetry instructions, coding instructions, and I believe Dragan is building a plugin-instruction dataset) but remove the synthetic chat datasets. It seems like we do not need Alpaca, Vicuna, WizardLM, prosocial, roleplay, etc datasets which are chat-focused and likely to be lower quality than filtered OA and Dolly data, perhaps with the exception of alpaca_gpt4?

There are some datasets I haven't looked at so am less sure about (OIG file, soda, webgpt, recipes, wiki QA datasets)

I see two obvious solutions/approaches for chat vs. plugins:

use something "mode" in system-prompt to specify whether we want instruction following or fluffy-talk mode

use multiple specialized models, e.g. one for chat and another one for instruction following

I personally prefer the idea of having a single model which can do both - it aligns much better with the OA vision of running on consumer hardware. So the system prompt idea seems like a good starting point, imo

draganjovanovich commented 1 year ago

I am for single mode/model approach. And completely removing instruction datasets seams a bit too much. But I am all in for leaving only top quality samples/datasets.

LAION-AI / Open-Assistant

Curate SFT-9 dataset mixes #3144