Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
loading data from ./data/train/Full/train_Full.jsonl ...
Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11125.47it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1671.70it/s]
Generating train split: 0 examples [00:20, ? examples/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4593.98it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 771.01it/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6944.21it/s]
Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1103.18it/s]
Generating train split: 0 examples [00:21, ? examples/s]
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3916.25it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 715.63it/s]
Generating train split: 0 examples [00:23, ? examples/s]
Traceback (most recent call last):
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "./FlagEmbedding/BGE_M3/data.py", line 63, in init
temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single
writer.write_table(table)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table
pa_table = pa_table.combine_chunks()
File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks
File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "./FlagEmbedding/BGE_M3/run.py", line 155, in
main()
File "./FlagEmbedding/BGE_M3/run.py", line 115, in main
train_dataset = SameDatasetTrainDataset(args=data_args,
File "./FlagEmbedding/BGE_M3/data.py", line 65, in init
temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat_kd)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4433.73it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 812.85it/s]
Generating train split: 0 examples [00:00, ? examples/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720682 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720683 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720684 closing signal SIGTERM
场景:使用BGE-M3进行finetune,数据文件.jsonl 含有158000行记录,每行记录一个query,pos列表的长度为1,neg列表的长度为15。 异常报错: WARNING:torch.distributed.run:
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
07/11/2024 12:50:54 - WARNING - main - Process rank: 3, device: cuda:3, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - WARNING - main - Process rank: 1, device: cuda:1, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - WARNING - main - Process rank: 0, device: cuda:0, n_gpu: 1, distributed training: True, 16-bits training: True 07/11/2024 12:50:54 - INFO - main - Training/evaluation parameters RetrieverTrainingArguments( _n_gpu=1, adafactor=False, adam_beta1=0.9, adam_beta2=0.999, adam_epsilon=1e-08, auto_find_batch_size=False, bf16=False, bf16_full_eval=False, colbert_dim=-1, data_seed=None, dataloader_drop_last=True, dataloader_num_workers=0, dataloader_persistent_workers=False, dataloader_pin_memory=True, ddp_backend=None, ddp_broadcast_buffers=None, ddp_bucket_cap_mb=None, ddp_find_unused_parameters=None, ddp_timeout=1800, debug=[], deepspeed=None, disable_tqdm=False, dispatch_batches=None, do_eval=False, do_predict=False, do_train=False, enable_sub_batch=True, eval_accumulation_steps=None, eval_delay=0, eval_steps=None, evaluation_strategy=no, fix_encoder=False, fix_position_embedding=False, fp16=True, fp16_backend=auto, fp16_full_eval=False, fp16_opt_level=O1, fsdp=[], fsdp_config={'min_num_params': 0, 'xla': False, 'xla_fsdp_grad_ckpt': False}, fsdp_min_num_params=0, fsdp_transformer_layer_cls_to_wrap=None, full_determinism=False, gradient_accumulation_steps=1, gradient_checkpointing=False, gradient_checkpointing_kwargs=None, greater_is_better=None, group_by_length=False, half_precision_backend=auto, hub_always_push=False, hub_model_id=None, hub_private_repo=False, hub_strategy=every_save, hub_token=,
ignore_data_skip=False,
include_inputs_for_metrics=False,
include_num_input_tokens_seen=False,
include_tokens_per_second=False,
jit_mode_eval=False,
label_names=None,
label_smoothing_factor=0.0,
learning_rate=1e-05,
length_column_name=length,
load_best_model_at_end=False,
local_rank=0,
log_level=passive,
log_level_replica=warning,
log_on_each_node=True,
logging_first_step=False,
logging_nan_inf_filter=True,
logging_steps=10,
logging_strategy=steps,
lr_scheduler_kwargs={},
lr_scheduler_type=linear,
max_grad_norm=1.0,
max_steps=-1,
metric_for_best_model=None,
mp_parameters=,
neftune_noise_alpha=None,
negatives_cross_device=True,
no_cuda=False,
normlized=True,
num_train_epochs=5.0,
optim=adamw_torch,
optim_args=None,
output_dir=./tunedModel/bge-m3/Full_E5,
overwrite_output_dir=False,
past_index=-1,
per_device_eval_batch_size=8,
per_device_train_batch_size=4,
prediction_loss_only=False,
push_to_hub=False,
push_to_hub_model_id=None,
push_to_hub_organization=None,
push_to_hub_token=,
ray_scope=last,
remove_unused_columns=True,
report_to=[],
resume_from_checkpoint=None,
run_name=./tunedModel/bge-m3/Full_E5,
save_on_each_node=False,
save_only_model=False,
save_safetensors=True,
save_steps=30000,
save_strategy=steps,
save_total_limit=None,
seed=42,
self_distill_start_step=-1,
sentence_pooling_method=cls,
skip_memory_metrics=True,
split_batches=False,
temperature=0.02,
tf32=None,
torch_compile=False,
torch_compile_backend=None,
torch_compile_mode=None,
torchdynamo=None,
tpu_metrics_debug=False,
tpu_num_cores=None,
unified_finetuning=True,
use_cpu=False,
use_ipex=False,
use_legacy_prediction_loop=False,
use_mps_device=False,
use_self_distill=False,
warmup_ratio=0.0,
warmup_steps=0,
weight_decay=0.0,
)
07/11/2024 12:50:54 - INFO - main - Model parameters ModelArguments(model_name_or_path='./models/bge-m3', config_name=None, tokenizer_name=None, cache_dir=None)
07/11/2024 12:50:54 - INFO - main - Data parameters DataArguments(knowledge_distillation=False, train_data=['./data/train/Full'], cache_path=None, train_group_size=5, query_max_len=128, passage_max_len=128, max_example_num_per_dataset=None, query_instruction_for_retrieval=None, passage_instruction_for_retrieval=None, same_task_within_batch=True, shuffle_ratio=0.0, small_threshold=0, drop_threshold=0)
07/11/2024 12:50:54 - WARNING - main - Process rank: 2, device: cuda:2, n_gpu: 1, distributed training: True, 16-bits training: True
07/11/2024 12:50:55 - INFO - main - Config: XLMRobertaConfig {
"_name_or_path": "./models/bge-m3",
"architectures": [
"XLMRobertaModel"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "LABEL_0"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"LABEL_0": 0
},
"layer_norm_eps": 1e-05,
"max_position_embeddings": 8194,
"model_type": "xlm-roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"output_past": true,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"torch_dtype": "float32",
"transformers_version": "4.37.2",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 250002
}
07/11/2024 12:50:59 - INFO - BGE_M3.modeling - loading existing colbert_linear and sparse_linear---------
Batch Size Dict: ['0-500: 4', '500-1000: 4', '1000-2000: 4', '2000-3000: 4', '3000-4000: 4', '4000-5000: 4', '5000-6000: 4', '6000-7000: 4', '7000-inf: 4']
loading data from ./data/train/Full/train_Full.jsonl ... Downloading data files: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11125.47it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1671.70it/s] Generating train split: 0 examples [00:20, ? examples/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4593.98it/s] Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 771.01it/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6944.21it/s] Extracting data files: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1103.18it/s] Generating train split: 0 examples [00:21, ? examples/s] Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3916.25it/s] Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 715.63it/s] Generating train split: 0 examples [00:23, ? examples/s] Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single writer.write_table(table) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "./FlagEmbedding/BGE_M3/data.py", line 63, in init temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset builder_instance.download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare self._download_and_prepare( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare self._prepare_split(split_generator, **prepare_split_kwargs) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split for job_id, done, content in self._prepare_split_single( File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single raise DatasetGenerationError("An error occurred while generating the dataset") from e datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
During handling of the above exception, another exception occurred:
Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1940, in _prepare_split_single writer.write_table(table) File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/arrow_writer.py", line 571, in write_table pa_table = pa_table.combine_chunks() File "pyarrow/table.pxi", line 3633, in pyarrow.lib.Table.combine_chunks File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 196, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/runpy.py", line 86, in _run_code exec(code, run_globals) File "./FlagEmbedding/BGE_M3/run.py", line 155, in
main()
File "./FlagEmbedding/BGE_M3/run.py", line 115, in main
train_dataset = SameDatasetTrainDataset(args=data_args,
File "./FlagEmbedding/BGE_M3/data.py", line 65, in init
temp_dataset = datasets.load_dataset('json', data_files=file_path, split='train', cache_dir=args.cache_path, features=context_feat_kd)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/load.py", line 2153, in load_dataset
builder_instance.download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 954, in download_and_prepare
self._download_and_prepare(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1049, in _download_and_prepare
self._prepare_split(split_generator, **prepare_split_kwargs)
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1813, in _prepare_split
for job_id, done, content in self._prepare_split_single(
File "/home/leeyl/anaconda3/envs/codesearch3/lib/python3.10/site-packages/datasets/builder.py", line 1958, in _prepare_split_single
raise DatasetGenerationError("An error occurred while generating the dataset") from e
datasets.builder.DatasetGenerationError: An error occurred while generating the dataset
Downloading data files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 4433.73it/s]
Extracting data files: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 812.85it/s]
Generating train split: 0 examples [00:00, ? examples/s]WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720682 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720683 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 720684 closing signal SIGTERM