Open LiYixuan727 opened 1 month ago
And here is the whole traceback:
2024-09-23 14:53:13 | INFO | fairseq_cli.train | task: TranslationTask
2024-09-23 14:53:13 | INFO | fairseq_cli.train | model: TransformerModel
2024-09-23 14:53:13 | INFO | fairseq_cli.train | criterion: LabelSmoothedCrossEntropyCriterion
2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. shared model params: 22,480,862,208 (num. trained: 22,480,862,208)
2024-09-23 14:53:13 | INFO | fairseq_cli.train | num. expert model params: 0 (num. trained: 0)
2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.en
2024-09-23 14:53:13 | INFO | fairseq.data.data_utils | loaded 51,352 examples from: data-bin/valid.en-es.es
2024-09-23 14:53:13 | INFO | fairseq.tasks.translation | data-bin valid en-es 51352 examples
2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2024-09-23 14:53:45 | INFO | fairseq.utils | rank 0: capabilities = 8.6 ; total memory = 47.431 GB ; name = NVIDIA RTX A6000
2024-09-23 14:53:45 | INFO | fairseq.utils | CUDA enviroments for all 1 workers
2024-09-23 14:53:45 | INFO | fairseq_cli.train | training on 1 devices (GPUs/TPUs)
2024-09-23 14:53:45 | INFO | fairseq_cli.train | max tokens per device = 4096 and max sentences per device = 5000
2024-09-23 14:53:45 | INFO | fairseq.trainer | Preparing to load checkpoint checkpoints/checkpoint_last.pt
2024-09-23 14:53:45 | INFO | fairseq.trainer | No existing checkpoint found checkpoints/checkpoint_last.pt
2024-09-23 14:53:45 | INFO | fairseq.trainer | loading train data for epoch 1
2024-09-23 14:53:49 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.en
2024-09-23 14:53:53 | INFO | fairseq.data.data_utils | loaded 51,249,574 examples from: data-bin/train.en-es.es
2024-09-23 14:53:53 | INFO | fairseq.tasks.translation | data-bin train en-es 51249574 examples
Traceback (most recent call last):
File "/home/ag/.local/bin/fairseq-train", line 8, in
I wanted to offer my assistance regarding the ValueError: offset must be non-negative and no greater than buffer length error you encountered while training with Fairseq.
Summary of the Issue: The error occurs during the training process, specifically when the code attempts to access an index in the dataset that is out of range. This typically indicates a potential issue with the dataset formatting or indexing.
Approach : Verify Dataset Integrity Check Data Loading and Indexing Consistency Between Datasets Adjust Worker Count Check Configuration Parameters Inspect Data Paths
Hi!
In my case this problem appeared because of a problem with integer precision when processing long files in the binarization of the corpus. It can be solved by adding here the following line:
sizes = [np.int64(el) for el in sizes]
address = np.int64(0)
And processing again the corpus with fairseq-preprocess
.
You could also avoid this problem by splitting your big files in smaller ones.
Hi, I'm training the fairseq with the following script and get the error ValueError: offset must be non-negative and no greater than buffer length.
fairseq-train data-bin --arch transformer \ --max-epoch 10 \ --max-tokens 2048\ --num-workers 20\ --max-sentences 5000\ --fp16\ --optimizer adam --lr-scheduler inverse_sqrt --lr 0.0007 \ --criterion label_smoothed_cross_entropy