lower01 commented 3 months ago

这个微调命令到底怎么搞啊？已经在命令行试了无数次了，无法识别和找到torchrun命令，torch版本1.8没问题，查了很多资料，torchrun是用于分布式训练计算什么的，还要去nvidia官网安装nccl库什么的，但windows本地安装nccl又好麻烦啊，而且也不知道是不是这个问题，我不想要分布式训练可以吗，就一张卡，torchrun那堆命令根本没办法识别啊？？？？？？？？？？？

有没有大佬指点一下，怎么运行命令啊到底？？？

staoxiao commented 3 months ago

The version of your torch is too low. You can use python xxx instead of torchrun xxx.

lower01 commented 3 months ago

The version of your torch is too low. You can use python xxx instead of torchrun xxx.

谢谢您，我这里还有一个问题，关于ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe错误是什么原因？如何解决呢？我是windows环境下的单卡GPU，其实用不到分布式训练以上报错信息的相关内容如下： File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(inputs) File "", line 134, in init File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 1641, in __post_init and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3) File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 2149, in device return self._setup_devices File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\utils\generic.py", line 59, in get cached = self.fget(obj) File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 2081, in _setup_devices self.distributed_state = PartialState( File "F:\anaconda3\envs\RAG\lib\site-packages\accelerate\state.py", line 192, in init__ torch.distributed.init_process_group(backend=self.backend, kwargs) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\distributed_c10d.py", line 583, in init_process_group default_pg = _new_process_group_helper( File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\distributed_c10d.py", line 708, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe Traceback (most recent call last): File "F:\anaconda3\envs\RAG\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "F:\anaconda3\envs\RAG\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "F:\anaconda3\envs\RAG\Scripts\torchrun.exemain.py", line 7, in File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\elastic\multiprocessing\errorsinit__.py", line 345, in wrapper return f(*args, **kwargs) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\run.py", line 719, in main run(args) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\run.py", line 710, in run elastic_launch( File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.finetune.run FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-30_15:07:39 host : MSI rank : 0 (local_rank: 0) exitcode : 1 (pid: 24584) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 希望您能解答

staoxiao commented 3 months ago

@lower01 can you show me the command you used?

lower01 commented 3 months ago

@lower01 can you show me the command you used?

好的，我的命令中的内容如下： torchrun
--nproc_per_node 1 -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --negatives_cross_device --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""

lower01 commented 3 months ago

@lower01 can you show me the command you used?

您好，我昨天给您发了命令，请问我的命令有什么问题吗？对于昨天提到的RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe错误，是本机没有nccl库的原因吗？但是windows上好像不能安装nccl库，还是什么其它问题？

staoxiao commented 3 months ago

Is there still an error when using python?

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""

lower01 commented 3 months ago

python -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --negatives_cross_device --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""

换成python之后好像的确没有命令执行上的问题了，但是还存在与分布式训练相关的错误

staoxiao commented 3 months ago

You can delete negatives_cross_device, which needs to initialize a distribution env.

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""

Besides, I think the per_device_train_batch_size is too small, might influence the performance.

lower01 commented 3 months ago

You can delete negatives_cross_device, which needs to initialize a distribution env.

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""

Besides, I think the per_device_train_batch_size is too small, might influence the performance.

好的，谢谢您，现在能够加载对应的数据，但是有一个type error如下：

我不能很好地定位问题

staoxiao commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

我把我的transformer从4.41变为4.37，运行出现了3个报错问题：

对于第3个问题，我本地的模型权重文件应该是没有问题，因为之前试验过加载模型对一个句子进行嵌入，我查阅了一下资料，但是说法很有，没有很好地解决

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

您好，我找到问题了，应该把transformer换成4.37.2，修改后再尝试既没有出现上面3个问题，也没有前面那个问题了谢谢您

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

您好，还想请问一下正常情况下需要多少显存的GPU微调bge-large-zh呢？在--per_device_train_batch_size 1的情况下我的4G显存无法支持微调

FlagOpen / FlagEmbedding

微调命令问题！！！ #836

FlagEmbedding.baai_general_embedding.finetune.run FAILED