Mitigate catastrophic forgetting

The current workflow leads to a certain amount of catastrophic forgetting, the base model used abacaj/phi-2-super reach an average of $62.13$ on the open_llm_leaderboard while the resulting model Thytu/phi-2-audio-super falls to $35.79$.

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
abacaj/phi-2-super	62.13	61.86	76.6	58.41	48.37	73.01	54.51
Thytu/phi-2-audio-super	35.79	33.96	43.17	28.67	50.91	58.01	0

While some kind of degradation is expected on a 2B parameters model, the resulting model shouldn't reach such a low average.

One interesting result is that when training the model on text-only data (meaning without training it to become multimodal) the Average still drops considerably

Model	Average	ARC	HellaSwag	MMLU	TruthfulQA	Winogrande	GSM8K
Multimodal	35.79	33.96	43.17	28.67	50.91	58.01	0
Text only	35.36	35.92	45.33	24.58	46.21	59.98	0.15

This can either means:

An issue in the training process, either regarding data processing or about the training itself
The instruct data is made of a poor quality (unlikely)

Thytu / SMIT

Mitigate catastrophic forgetting #16