Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting

Description:

When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:

Note: Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.

Incorrect Size Handling: The casting to np.uint16 truncates the sizes, leading self.sizes to store values that do not correspond to the actual sizes of data blocks. See these lines: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/token_block_dataset.py#L136-L140
```
size_dtype = np.uint16 if block_size < 65535 else np.uint32
...
_sizes = _sizes.astype(size_dtype)
```
for exmaple:
Filtering Issue: During filtering in filter_indices_by_size, the incorrect sizes in self.sizes can cause sentences with more tokens than max_tokens to be incorrectly retained, bypassing intended filtering logic. See this line: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/fairseq_dataset.py#L174
```
indices = indices[self.sizes[indices] <= max_sizes]
```
OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.

Steps to Reproduce:

Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
Watch _size at the indice of that data sample get reduced.
Watch the indice of the data sample go past the filtering un-ignored.

Expected Behavior: self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.

Proposed Solution:

Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
Assert that no mismatch between _sizes and the actual sizes.
Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.

facebookresearch / fairseq

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527