Modalities / modalities

A framework for training multimodal foundation models.
MIT License
38 stars 3 forks source link

Fix/dataset index: Index values were faulty when indexing the original samples instead of blocks. #164

Closed le1nux closed 1 week ago

le1nux commented 2 weeks ago

What does this PR do?

The index values in the pbin files had the wrong values. They did start with an offset and additionally, we added another offset of HEADER size when reading from the file buffer. See here for the initial offset during pbin index creation: https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/create_packed_data.py#L145

and the additional offset that is used when reading from the memmap during training:

https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/create_packed_data.py#L262

This PR fixes this issue and makes the index always start at byte 0, only applying the offset once when reading from the memmap file.

General changes

Breaking Changes

Checklist before submitting final PR

le1nux commented 2 weeks ago

fixes #163

le1nux commented 2 weeks ago

Yes, the inheritance structure can be improved. I suggest we do this in a separate PR together with improving the "packing" terms in those cases when there is no actual packing happening.

I added the issue https://github.com/Modalities/modalities/issues/167 for addressing this.