Since we were only using packing in practice this issue has not been observed so far.
While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.
Since we were only using packing in practice this issue has not been observed so far.
While fixing this, we should also use byte-wise indices globally and then when reading from the memmap file as part of the buffer recompute the index position to token position in the buffer.
https://github.com/Modalities/modalities/blob/4aa2e88efe13c3eaab4c6b425fdb82caf0d2a443/src/modalities/dataloader/dataset.py#L161C1-L169C60
The documentation also need the respective updates.