facebookresearch / fairseq

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
MIT License
30.57k stars 6.41k forks source link

ValueError: offset must be non-negative and no greater than buffer length #5543

Open LiYixuan727 opened 1 month ago

LiYixuan727 commented 1 month ago

Hi, I'm training the fairseq with the following script and get the error ValueError: offset must be non-negative and no greater than buffer length.

fairseq-train data-bin --arch transformer \ --max-epoch 10 \ --max-tokens 2048\ --num-workers 20\ --max-sentences 5000\ --fp16\ --optimizer adam --lr-scheduler inverse_sqrt --lr 0.0007 \ --criterion label_smoothed_cross_entropy

LiYixuan727 commented 1 month ago

And here is the whole traceback:

2024-09-23 14:53:13 | INFO | fairseq_cli.train | task: TranslationTask 2024-09-23 14:53:13 | INFO | fairseq_cli.train | model: TransformerModel 2024-09-23 14:53:13 | INFO | fairseq_cli.train | criterion: LabelSmoothedCrossEntropyCriterion 2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. shared model params: 22,480,862,208 (num. trained: 22,480,862,208) 2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0) 2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.en 2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.es 2024-09-23 14:53:13 | INFO | fairseq.tasks.translation | data-bin valid en-es 51352 examples 2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers 2024-09-23 14:53:45 | INFO | fairseq.utils | rank 0: capabilities = 8.6 ; total memory = 47.431 GB ; name = NVIDIA RTX A6000
2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers 2024-09-23 14:53:45 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs) 2024-09-23 14:53:45 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = 5000 2024-09-23 14:53:45 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt 2024-09-23 14:53:45 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt 2024-09-23 14:53:45 | INFO | fairseq.trainer | loading train data for epoch 1 2024-09-23 14:53:49 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.en 2024-09-23 14:53:53 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.es 2024-09-23 14:53:53 | INFO | fairseq.tasks.translation | data-bin train en-es 51249574 examples Traceback (most recent call last): File "/home/ag/.local/bin/fairseq-train", line 8, in sys.exit(cli_main()) File "/home/ag/.local/lib/python3.10/site-packages/fairseq_cli/train.py", line 557, in cli_main distributed_utils.call_main(cfg, main) File "/home/ag/.local/lib/python3.10/site-packages/fairseq/distributed/utils.py", line 369, in call_main main(cfg, **kwargs) File "/home/ag/.local/lib/python3.10/site-packages/fairseq_cli/train.py", line 164, in main extra_state, epoch_itr = checkpoint_utils.load_checkpoint( File "/home/ag/.local/lib/python3.10/site-packages/fairseq/checkpoint_utils.py", line 272, in load_checkpoint epoch_itr = trainer.get_train_iterator( File "/home/ag/.local/lib/python3.10/site-packages/fairseq/trainer.py", line 719, in get_train_iterator self.reset_dummy_batch(batch_iterator.first_batch) File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/iterators.py", line 368, in first_batch return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]]) File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/iterators.py", line 368, in return self.collate_fn([self.dataset[i] for i in self.frozen_batches[0]]) File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/language_pair_dataset.py", line 305, in getitem tgt_item = self.tgt[index] if self.tgt is not None else None File "/home/ag/.local/lib/python3.10/site-packages/fairseq/data/indexed_dataset.py", line 523, in getitem np_array = np.frombuffer( ValueError: offset must be non-negative and no greater than buffer length (6711936916)

Herostomo commented 1 month ago

I wanted to offer my assistance regarding the ValueError: offset must be non-negative and no greater than buffer length error you encountered while training with Fairseq.

Summary of the Issue: The error occurs during the training process, specifically when the code attempts to access an index in the dataset that is out of range. This typically indicates a potential issue with the dataset formatting or indexing.

Approach : Verify Dataset Integrity Check Data Loading and Indexing Consistency Between Datasets Adjust Worker Count Check Configuration Parameters Inspect Data Paths

dtamayo-nlp commented 1 month ago

Hi!

In my case this problem appeared because of a problem with integer precision when processing long files in the binarization of the corpus. It can be solved by adding here the following line:

sizes = [np.int64(el) for el in sizes]
address = np.int64(0)

And processing again the corpus with fairseq-preprocess.

You could also avoid this problem by splitting your big files in smaller ones.