When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:
Note:
Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.
indices = indices[self.sizes[indices] <= max_sizes]
OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.
Steps to Reproduce:
Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
Watch _size at the indice of that data sample get reduced.
Watch the indice of the data sample go past the filtering un-ignored.
Expected Behavior:self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.
Proposed Solution:
Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
Assert that no mismatch between _sizes and the actual sizes.
Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.
Description:
When initializing an object of
TokenBlockDataset
, the sizes of data blocks (_sizes
) are cast tonp.uint16
or np.uint32 based onblock_size
. However, this casting leads to potential issues later in the code:Note: Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.
Incorrect Size Handling: The casting to
np.uint16
truncates the sizes, leadingself.sizes
to store values that do not correspond to the actual sizes of data blocks. See these lines: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/token_block_dataset.py#L136-L140for exmaple:
Filtering Issue: During filtering in
filter_indices_by_size
, the incorrect sizes inself.sizes
can cause sentences with more tokens thanmax_tokens
to be incorrectly retained, bypassing intended filtering logic. See this line: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/fairseq_dataset.py#L174OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.
Steps to Reproduce:
--max-tokens
below 65535 and have a data smple with more then 65535 tokens._size
at the indice of that data sample get reduced.Expected Behavior:
self.sizes
should accurately reflect the sizes of data blocks, irrespective of the casting tonp.uint16
ornp.uint32
.Proposed Solution:
_sizes
andself.sizes
to ensure that casting tonp.uint16
ornp.uint32
does not compromise the integrity of size information._sizes
and the actual sizes.filter_indices_by_size
to correctly handle sizes.