FlagOpen / FlagEmbedding

Retrieval and Retrieval-augmented LLMs
MIT License
6.88k stars 498 forks source link

微调命令问题!!! #836

Closed lower01 closed 3 months ago

lower01 commented 3 months ago

这个微调命令到底怎么搞啊?已经在命令行试了无数次了,无法识别和找到torchrun命令,torch版本1.8没问题,查了很多资料,torchrun是用于分布式训练计算什么的,还要去nvidia官网安装nccl库什么的,但windows本地安装nccl又好麻烦啊,而且也不知道是不是这个问题,我不想要分布式训练可以吗,就一张卡,torchrun那堆命令根本没办法识别啊???????????

有没有大佬指点一下,怎么运行命令啊到底??? image

staoxiao commented 3 months ago

The version of your torch is too low. You can use python xxx instead of torchrun xxx.

lower01 commented 3 months ago

The version of your torch is too low. You can use python xxx instead of torchrun xxx.

谢谢您,我这里还有一个问题,关于ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe错误是什么原因?如何解决呢?我是windows环境下的单卡GPU,其实用不到分布式训练 以上报错信息的相关内容如下: File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\hf_argparser.py", line 339, in parse_args_into_dataclasses obj = dtype(inputs) File "", line 134, in init File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 1641, in __post_init and (self.device.type == "cpu" and not is_torch_greater_or_equal_than_2_3) File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 2149, in device return self._setup_devices File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\utils\generic.py", line 59, in get cached = self.fget(obj) File "F:\anaconda3\envs\RAG\lib\site-packages\transformers\training_args.py", line 2081, in _setup_devices self.distributed_state = PartialState( File "F:\anaconda3\envs\RAG\lib\site-packages\accelerate\state.py", line 192, in init__ torch.distributed.init_process_group(backend=self.backend, kwargs) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\distributed_c10d.py", line 583, in init_process_group default_pg = _new_process_group_helper( File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\distributed_c10d.py", line 708, in _new_process_group_helper raise RuntimeError("Distributed package doesn't have NCCL " "built in") RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe Traceback (most recent call last): File "F:\anaconda3\envs\RAG\lib\runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "F:\anaconda3\envs\RAG\lib\runpy.py", line 87, in _run_code exec(code, run_globals) File "F:\anaconda3\envs\RAG\Scripts\torchrun.exe__main.py", line 7, in File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\elastic\multiprocessing\errors__init__.py", line 345, in wrapper return f(*args, **kwargs) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\run.py", line 719, in main run(args) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\run.py", line 710, in run elastic_launch( File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\launcher\api.py", line 131, in call__ return launch_agent(self._config, self._entrypoint, list(args)) File "F:\anaconda3\envs\RAG\lib\site-packages\torch\distributed\launcher\api.py", line 259, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

FlagEmbedding.baai_general_embedding.finetune.run FAILED

Failures:

------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2024-05-30_15:07:39 host : MSI rank : 0 (local_rank: 0) exitcode : 1 (pid: 24584) error_file: traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================ 希望您能解答
staoxiao commented 3 months ago

@lower01 can you show me the command you used?

lower01 commented 3 months ago

@lower01 can you show me the command you used?

好的,我的命令中的内容如下: torchrun
--nproc_per_node 1 -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --negatives_cross_device --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""

lower01 commented 3 months ago

@lower01 can you show me the command you used?

您好,我昨天给您发了命令,请问我的命令有什么问题吗?对于昨天提到的RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe错误,是本机没有nccl库的原因吗?但是windows上好像不能安装nccl库,还是什么其它问题?

staoxiao commented 3 months ago

Is there still an error when using python?

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""
lower01 commented 3 months ago

python -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --negatives_cross_device --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""

换成python之后好像的确没有命令执行上的问题了,但是还存在与分布式训练相关的错误 image

staoxiao commented 3 months ago

You can delete negatives_cross_device, which needs to initialize a distribution env.

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""

Besides, I think the per_device_train_batch_size is too small, might influence the performance.

lower01 commented 3 months ago

You can delete negatives_cross_device, which needs to initialize a distribution env.

python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""

Besides, I think the per_device_train_batch_size is too small, might influence the performance.

好的,谢谢您,现在能够加载对应的数据,但是有一个type error如下: image

我不能很好地定位问题

staoxiao commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

我把我的transformer从4.41变为4.37,运行出现了3个报错问题: image image image

对于第3个问题,我本地的模型权重文件应该是没有问题,因为之前试验过加载模型对一个句子进行嵌入,我查阅了一下资料,但是说法很有,没有很好地解决

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

您好,我找到问题了,应该把transformer换成4.37.2,修改后再尝试既没有出现上面3个问题,也没有前面那个问题了 谢谢您

lower01 commented 3 months ago

You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37

您好,还想请问一下正常情况下需要多少显存的GPU微调bge-large-zh呢?在--per_device_train_batch_size 1的情况下我的4G显存无法支持微调