FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
7.19k stars 522 forks source link

关于BGE-M3在微调时报:pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays #955

Open MarcusEddie opened 3 months ago

MarcusEddie commented 3 months ago

场景:使用BGE-M3进行finetune,数据文件.jsonl 含有158000行记录,每行记录一个query,pos列表的长度为1,neg列表的长度为15。 异常报错: WARNING:torch.distributed.run:


Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.


07/11/2024 12:50:54 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - INFO - main - Training/evaluation parameters RetrieverTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, colbert_dim=-1, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, enable_sub_batch=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fix_encoder=False, fix_position_embedding=False, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=, ignore_data_skip=False, include_inputs_for_metrics=False, include_num_input_tokens_seen=False, include_tokens_per_second=False, jit_mode_eval=False, label_names=None, label_smoothing_factor=0.0, learning_rate=1e-05, length_column_name=length, load_best_model_at_end=False, local_rank=0, log_level=passive, log_level_replica=warning, log_on_each_node=True, logging_first_step=False, logging_nan_inf_filter=True, logging_steps=10, logging_strategy=steps, lr_scheduler_kwargs={}, lr_scheduler_type=linear, max_grad_norm=1.0, max_steps=-1, metric_for_best_model=None, mp_parameters=, neftune_noise_alpha=None, negatives_cross_device=True, no_cuda=False, normlized=True, num_train_epochs=5.0, optim=adamw_torch, optim_args=None, output_dir=./tunedModel/bge-m3/Full_E5, overwrite_output_dir=False, past_index=-1, per_device_eval_batch_size=8, per_device_train_batch_size=4, prediction_loss_only=False, push_to_hub=False, push_to_hub_model_id=None, push_to_hub_organization=None, push_to_hub_token=, ray_scope=last, remove_unused_columns=True, report_to=[], resume_from_checkpoint=None, run_name=./tunedModel/bge-m3/Full_E5, save_on_each_node=False, save_only_model=False, save_safetensors=True, save_steps=30000, save_strategy=steps, save_total_limit=None, seed=42, self_distill_start_step=-1, sentence_pooling_method=cls, skip_memory_metrics=True, split_batches=False, temperature=0.02, tf32=None, torch_compile=False, torch_compile_backend=None, torch_compile_mode=None, torchdynamo=None, tpu_metrics_debug=False, tpu_num_cores=None, unified_finetuning=True, use_cpu=False, use_ipex=False, use_legacy_prediction_loop=False, use_mps_device=False, use_self_distill=False, warmup_ratio=0.0, warmup_steps=0, weight_decay=0.0, ) 07/11/2024 12:50:54 - INFO - main - Model parameters ModelArguments(model_name_or_path='./models/bge-m3', config_name=None, tokenizer_name=None, cache_dir=None) 07/11/2024 12:50:54 - INFO - main - Data parameters DataArguments(knowledge_distillation=False, train_data=['./data/train/Full'], cache_path=None, train_group_size=5, query_max_len=128, passage_max_len=128, max_example_num_per_dataset=None, query_instruction_for_retrieval=None, passage_instruction_for_retrieval=None, same_task_within_batch=True, shuffle_ratio=0.0, small_threshold=0, drop_threshold=0) 07/11/2024 12:50:54 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:55 - INFO - main - Config: XLMRobertaConfig { "_name_or_path": "./models/bge-m3", "architectures": [ "XLMRobertaModel" ], "attention_probs_dropout_prob": 0.1, "bos_token_id": 0, "classifier_dropout": null, "eos_token_id": 2, "hidden_act": "gelu", "hidden_dropout_prob": 0.1, "hidden_size": 1024, "id2label": { "0": "LABEL_0" }, "initializer_range": 0.02, "intermediate_size": 4096, "label2id": { "LABEL_0": 0 }, "layer_norm_eps": 1e-05, "max_position_embeddings": 8194, "model_type": "xlm-roberta", "num_attention_heads": 16, "num_hidden_layers": 24, "output_past": true, "pad_token_id": 1, "position_embedding_type": "absolute", "torch_dtype": "float32", "transformers_version": "4.37.2", "type_vocab_size": 1, "use_cache": true, "vocab_size": 250002 }

07/11/2024 12:50:59 - INFO - BGE_M3.modeling - loading existing colbert_linear and sparse_linear---------

Batch Size Dict: ['0-500: 4', '500-1000: 4', '1000-2000: 4', '2000-3000: 4', '3000-4000: 4', '4000-5000: 4', '5000-6000: 4', '6000-7000: 4', '7000-inf: 4']

loading data from ./data/train/Full/train_Full.jsonl ... Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11125.47it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1671.70it/s] Generating train split: 0 examples [00:20, ? examples/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4593.98it/s] Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 771.01it/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6944.21it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1103.18it/s] Generating train split: 0 examples [00:21, ? examples/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3916.25it/s] Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 715.63it/s] Generating train split: 0 examples [00:23, ? examples/s] Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single writer.write_table(table) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "./FlagEmbedding/BGE_M3/data.py", line 63, in init temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset builder_instance.download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare self._download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single writer.write_table(table) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays

The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "./FlagEmbedding/BGE_M3/run.py", line 155, in main() File "./FlagEmbedding/BGE_M3/run.py", line 115, in main train_dataset = SameDatasetTrainDataset(args=data_args, File "./FlagEmbedding/BGE_M3/data.py", line 65, in init temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat_kd) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset builder_instance.download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare self._download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4433.73it/s] Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 812.85it/s] Generating train split: 0 examples [00:00, ? examples/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720682 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720683 closing signal SIGTERM WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720684 closing signal SIGTERM

staoxiao commented 3 months ago

数据格式问题,确认文件格式正确。 如果文件格式没问题,可以更换不同版本的datasets库试试。