0-5788719150923125 / praxis

as above, so below
https://src.eco
MIT License
2 stars 1 forks source link

We need to split the data better #10

Closed Vectorrent closed 4 weeks ago

Vectorrent commented 1 month ago

Currently, every example in a batch will come from a single dataset. Meaning, if we have a batch size of 3, we might get an example from 3 different batches. However, within each sample - ALL of the data will come from the same dataset. This is sub-optimal.

Ideally, we would split ALL data on the eos_token, such that a single example in a batch could have data that came from multiple sources. Not only would this add more variety to the training (which is always a good thing), it would allow multiple datasets to exist in a single example - which would greatly help with generalization.

To put this into perspective, with the way things are today, it is essentially as we are telling the model:

"If we are viewing 'chat data' (in your prompt), then you may NEVER use the 'research paper' format."

It's hard to explain, but trust me: this needs fixing.

Vectorrent commented 4 weeks ago

I implemented a builder.py class, which handles data preparation in a much more elegant way. It has other problems, but it's much cleaner code, more interpretable, less-entangled without training code, and will make a good foundation for further development in the future. Will close this issue for now.