facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.57k stars 6.41k forks source link

Bug in TokenBlockDataset causing potential CUDA OOM due to incorrect size casting #5527

Open YuvalRingel opened 4 months ago

YuvalRingel commented 4 months ago

Description:

When initializing an object of TokenBlockDataset, the sizes of data blocks (_sizes) are cast to np.uint16 or np.uint32 based on block_size. However, this casting leads to potential issues later in the code:

Note: Though this is an edge case that requires a very long sentence, it is indeed possible, and the training process doesn't capture that there's an issue. Leading to OOM exception for cuda, or faulty training.

  1. Incorrect Size Handling: The casting to np.uint16 truncates the sizes, leading self.sizes to store values that do not correspond to the actual sizes of data blocks. See these lines: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/token_block_dataset.py#L136-L140

    size_dtype = np.uint16 if block_size < 65535 else np.uint32
    ...
    _sizes = _sizes.astype(size_dtype)

    for exmaple: image

  2. Filtering Issue: During filtering in filter_indices_by_size, the incorrect sizes in self.sizes can cause sentences with more tokens than max_tokens to be incorrectly retained, bypassing intended filtering logic. See this line: https://github.com/facebookresearch/fairseq/blob/d9a627082fd03ec72a27a31a4e56289bfcb2e4e4/fairseq/data/fairseq_dataset.py#L174

    indices = indices[self.sizes[indices] <= max_sizes]
  3. OOM in CUDA: This issue can propagate during data iteration and model training, potentially causing CUDA Out-of-Memory (OOM) errors when processing data samples with an extreme number of tokens that were not properly filtered out.

Steps to Reproduce:

  1. Train a model with --max-tokens below 65535 and have a data smple with more then 65535 tokens.
  2. Watch _size at the indice of that data sample get reduced.
  3. Watch the indice of the data sample go past the filtering un-ignored.

Expected Behavior: self.sizes should accurately reflect the sizes of data blocks, irrespective of the casting to np.uint16 or np.uint32.

Proposed Solution:

  1. Adjust the handling of _sizes and self.sizes to ensure that casting to np.uint16 or np.uint32 does not compromise the integrity of size information.
  2. Assert that no mismatch between _sizes and the actual sizes.
  3. Implement a verification mechanism or adjust the filtering logic in filter_indices_by_size to correctly handle sizes.