We need to split the data better

Currently, every example in a batch will come from a single dataset. Meaning, if we have a batch size of 3, we might get an example from 3 different batches. However, within each sample - ALL of the data will come from the same dataset. This is sub-optimal.

Ideally, we would split ALL data on the eos_token, such that a single example in a batch could have data that came from multiple sources. Not only would this add more variety to the training (which is always a good thing), it would allow multiple datasets to exist in a single example - which would greatly help with generalization.

To put this into perspective, with the way things are today, it is essentially as we are telling the model:

"If we are viewing 'chat data' (in your prompt), then you may NEVER use the 'research paper' format."

It's hard to explain, but trust me: this needs fixing.

0-5788719150923125 / praxis

We need to split the data better #10