Modalities / modalities

A framework for training multimodal foundation models.
MIT License
38 stars 3 forks source link

Bug: Dataset implementation does not #163

Closed le1nux closed 2 days ago

le1nux commented 2 weeks ago

Since we were only using packing in practice this issue has not been observed so far.

While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.

https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60

The documentation also need the respective updates.