Closed Vectorrent closed 4 weeks ago
I implemented a builder.py
class, which handles data preparation in a much more elegant way. It has other problems, but it's much cleaner code, more interpretable, less-entangled without training code, and will make a good foundation for further development in the future. Will close this issue for now.
Currently, every example in a batch will come from a single dataset. Meaning, if we have a batch size of 3, we might get an example from 3 different batches. However, within each sample - ALL of the data will come from the same dataset. This is sub-optimal.
Ideally, we would split ALL data on the
eos_token
, such that a single example in a batch could have data that came from multiple sources. Not only would this add more variety to the training (which is always a good thing), it would allow multiple datasets to exist in a single example - which would greatly help with generalization.To put this into perspective, with the way things are today, it is essentially as we are telling the model:
"If we are viewing 'chat data' (in your prompt), then you may NEVER use the 'research paper' format."
It's hard to explain, but trust me: this needs fixing.