NLPJCL / RAG-Retrieval

Unify Efficient Fine-tuning of RAG Retrieval, including Embedding, ColBERT, ReRanker.
MIT License
529 stars 49 forks source link

微调reranker时,max_len参数不能修改吗 #51

Open lxlx2084 opened 2 days ago

lxlx2084 commented 2 days ago

我把它从512改为1024,发现报错了

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. train args:{'model_name_or_path': '/data/liuxiang/58/10_llm/6_姣旇禌/bge-reranker-large', 'model_type': 'cross_encoder', 'dataset': '../../../example_data/train_9.jsonl', 'output_dir': './output/train_9_ep15_bs32_lr2', 'save_on_epoch_end': 1, 'num_max_checkpoints': 5, 'max_len': 1024, 'epochs': 15, 'lr': 2e-05, 'batch_size': 32, 'seed': 666, 'warmup_proportion': 0.1, 'loss_type': 'classfication', 'log_with': 'wandb', 'mixed_precision': 'fp16', 'gradient_accumulation_steps': 3, 'num_labels': 1}

0it [00:00, ?it/s] 275it [00:00, 2748.69it/s] 1332it [00:00, 7346.62it/s] 1912it [00:00, 8260.15it/s] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Start training for 15 epochs

0%| | 0/713 00:00<?, ?it/s: Traceback (most recent call last): rank0: File "train_reranker.py", line 167, in

rank0: File "train_reranker.py", line 154, in main

rank0: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/trainer.py", line 68, in train rank0: batch_output = self.model(batch[0],batch[1]) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward rank0: output = self._fsdp_wrapped_module(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 820, in forward rank0: return model_forward(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 808, in call rank0: return convert_to_fp32(self.model_forward(*args, kwargs)) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast rank0: return func(args, kwargs) rank0: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/model_bert.py", line 25, in forward rank0: output = self.model(batch,labels=labels) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1327, in forward rank0: outputs = self.roberta( rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 908, in forward rank0: buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) rank0: RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [32, 1024]. Tensor sizes: [1, 514]

0%| | 0/713 [00:00<?, ?it/s] E1202 21:48:31.724127 140026173690496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1690989) of binary: /data/liuxiang/anaconda3/envs/rag/bin/python Traceback (most recent call last): File "/data/liuxiang/anaconda3/envs/rag/bin/accelerate", line 8, in sys.exit(main()) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1155, in launch_command multi_gpu_launcher(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

NLPJCL commented 2 days ago

因为普通bert 的最大长度是 512(位置embedding)

---原始邮件--- 发件人: @.> 发送时间: 2024年12月2日(周一) 晚上9:52 收件人: @.>; 抄送: @.***>; 主题: [NLPJCL/RAG-Retrieval] 微调reranker时,max_len参数不能修改吗 (Issue #51)

我把它从512改为1024,发现报错了

Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. train args:{'model_name_or_path': '/data/liuxiang/58/10_llm/6_姣旇禌/bge-reranker-large', 'model_type': 'cross_encoder', 'dataset': '../../../example_data/train_9.jsonl', 'output_dir': './output/train_9_ep15_bs32_lr2', 'save_on_epoch_end': 1, 'num_max_checkpoints': 5, 'max_len': 1024, 'epochs': 15, 'lr': 2e-05, 'batch_size': 32, 'seed': 666, 'warmup_proportion': 0.1, 'loss_type': 'classfication', 'log_with': 'wandb', 'mixed_precision': 'fp16', 'gradient_accumulation_steps': 3, 'num_labels': 1}

0it [00:00, ?it/s] 275it [00:00, 2748.69it/s] 1332it [00:00, 7346.62it/s] 1912it [00:00, 8260.15it/s] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Start training for 15 epochs

0%| | 0/713 00:00<?, ?it/s: Traceback (most recent call last): rank0: File "train_reranker.py", line 167, in

rank0: File "train_reranker.py", line 154, in main

rank0: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/trainer.py", line 68, in train rank0: batch_output = self.model(batch[0],batch[1]) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward rank0: output = self._fsdp_wrapped_module(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 820, in forward rank0: return model_forward(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 808, in call rank0: return convert_to_fp32(self.model_forward(*args, kwargs)) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast rank0: return func(args, kwargs) rank0: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/model_bert.py", line 25, in forward rank0: output = self.model(batch,labels=labels) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(*args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1327, in forward rank0: outputs = self.roberta( rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl rank0: return self._call_impl(*args, *kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl rank0: return forward_call(args, kwargs) rank0: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 908, in forward rank0: buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) rank0: RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [32, 1024]. Tensor sizes: [1, 514]

0%| | 0/713 [00:00<?, ?it/s] E1202 21:48:31.724127 140026173690496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1690989) of binary: /data/liuxiang/anaconda3/envs/rag/bin/python Traceback (most recent call last): File "/data/liuxiang/anaconda3/envs/rag/bin/accelerate", line 8, in sys.exit(main()) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1155, in launch_command multi_gpu_launcher(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

lxlx2084 commented 2 days ago

因为普通bert 的最大长度是 512(位置embedding) ---原始邮件--- 发件人: @.> 发送时间: 2024年12月2日(周一) 晚上9:52 收件人: @.>; 抄送: @.>; 主题: [NLPJCL/RAG-Retrieval] 微调reranker时,max_len参数不能修改吗 (Issue #51) 我把它从512改为1024,发现报错了 Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. train args:{'model_name_or_path': '/data/liuxiang/58/10_llm/6_姣旇禌/bge-reranker-large', 'model_type': 'cross_encoder', 'dataset': '../../../example_data/train_9.jsonl', 'output_dir': './output/train_9_ep15_bs32_lr2', 'save_on_epoch_end': 1, 'num_max_checkpoints': 5, 'max_len': 1024, 'epochs': 15, 'lr': 2e-05, 'batch_size': 32, 'seed': 666, 'warmup_proportion': 0.1, 'loss_type': 'classfication', 'log_with': 'wandb', 'mixed_precision': 'fp16', 'gradient_accumulation_steps': 3, 'num_labels': 1} 0it [00:00, ?it/s] 275it [00:00, 2748.69it/s] 1332it [00:00, 7346.62it/s] 1912it [00:00, 8260.15it/s] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Start training for 15 epochs 0%| | 0/713 [00:00<?, ?it/s][rank0]: Traceback (most recent call last): [rank0]: File "train_reranker.py", line 167, in [rank0]: main() [rank0]: File "train_reranker.py", line 154, in main [rank0]: trainer.train() [rank0]: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/trainer.py", line 68, in train [rank0]: batch_output = self.model(batch[0],batch[1]) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward [rank0]: output = self._fsdp_wrapped_module(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 820, in forward [rank0]: return model_forward(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 808, in call [rank0]: return convert_to_fp32(self.model_forward(args, kwargs)) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast [rank0]: return func(args, kwargs) [rank0]: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/model_bert.py", line 25, in forward [rank0]: output = self.model(batch,labels=labels) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1327, in forward [rank0]: outputs = self.roberta( [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 908, in forward [rank0]: buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) [rank0]: RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [32, 1024]. Tensor sizes: [1, 514] 0%| | 0/713 [00:00<?, ?it/s] E1202 21:48:31.724127 140026173690496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1690989) of binary: /data/liuxiang/anaconda3/envs/rag/bin/python Traceback (most recent call last): File "/data/liuxiang/anaconda3/envs/rag/bin/accelerate", line 8, in sys.exit(main()) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1155, in launch_command multi_gpu_launcher(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

可是训练数据中,每个正负样本的文本长度远大于512,这样只能看到前面一部分内容了。

NLPJCL commented 2 days ago

搜索下怎么扩展bert 的长度,有很多方法。

---原始邮件--- 发件人: @.> 发送时间: 2024年12月2日(周一) 晚上9:57 收件人: @.>; 抄送: @.**@.>; 主题: Re: [NLPJCL/RAG-Retrieval] 微调reranker时,max_len参数不能修改吗 (Issue #51)

因为普通bert 的最大长度是 512(位置embedding) … ---原始邮件--- 发件人: @.> 发送时间: 2024年12月2日(周一) 晚上9:52 收件人: @.>; 抄送: @.>; 主题: [NLPJCL/RAG-Retrieval] 微调reranker时,max_len参数不能修改吗 (Issue #51) 我把它从512改为1024,发现报错了 Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. train args:{'model_name_or_path': '/data/liuxiang/58/10_llm/6_姣旇禌/bge-reranker-large', 'model_type': 'cross_encoder', 'dataset': '../../../example_data/train_9.jsonl', 'output_dir': './output/train_9_ep15_bs32_lr2', 'save_on_epoch_end': 1, 'num_max_checkpoints': 5, 'max_len': 1024, 'epochs': 15, 'lr': 2e-05, 'batch_size': 32, 'seed': 666, 'warmup_proportion': 0.1, 'loss_type': 'classfication', 'log_with': 'wandb', 'mixed_precision': 'fp16', 'gradient_accumulation_steps': 3, 'num_labels': 1} 0it [00:00, ?it/s] 275it [00:00, 2748.69it/s] 1332it [00:00, 7346.62it/s] 1912it [00:00, 8260.15it/s] Detected kernel version 5.4.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. Start training for 15 epochs 0%| | 0/713 [00:00<?, ?it/s][rank0]: Traceback (most recent call last): [rank0]: File "train_reranker.py", line 167, in [rank0]: main() [rank0]: File "train_reranker.py", line 154, in main [rank0]: trainer.train() [rank0]: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/trainer.py", line 68, in train [rank0]: batch_output = self.model(batch[0],batch[1]) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/fsdp/fully_sharded_data_parallel.py", line 863, in forward [rank0]: output = self._fsdp_wrapped_module(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 820, in forward [rank0]: return model_forward(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/utils/operations.py", line 808, in call [rank0]: return convert_to_fp32(self.model_forward(args, kwargs)) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast [rank0]: return func(args, kwargs) [rank0]: File "/data/liuxiang/58/10_llm/6_姣旇禌/RAG-Retrieval/rag_retrieval/train/reranker/model_bert.py", line 25, in forward [rank0]: output = self.model(batch,labels=labels) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(*args, *kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 1327, in forward [rank0]: outputs = self.roberta( [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl [rank0]: return self._call_impl(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl [rank0]: return forward_call(args, kwargs) [rank0]: File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/transformers/models/xlm_roberta/modeling_xlm_roberta.py", line 908, in forward [rank0]: buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length) [rank0]: RuntimeError: The expanded size of the tensor (1024) must match the existing size (514) at non-singleton dimension 1. Target sizes: [32, 1024]. Tensor sizes: [1, 514] 0%| | 0/713 [00:00<?, ?it/s] E1202 21:48:31.724127 140026173690496 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1690989) of binary: /data/liuxiang/anaconda3/envs/rag/bin/python Traceback (most recent call last): File "/data/liuxiang/anaconda3/envs/rag/bin/accelerate", line 8, in sys.exit(main()) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main args.func(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 1155, in launch_command multi_gpu_launcher(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/accelerate/commands/launch.py", line 793, in multi_gpu_launcher distrib_run.run(args) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/run.py", line 892, in run elastic_launch( File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 133, in call return launch_agent(self._config, self._entrypoint, list(args)) File "/data/liuxiang/anaconda3/envs/rag/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

可是训练数据中,每个正负样本的文本长度远大于512,这样只能看到前面一部分内容了。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>