Closed lower01 closed 3 months ago
The version of your torch is too low. You can use python xxx
instead of torchrun xxx
.
The version of your torch is too low. You can use
python xxx
instead oftorchrun xxx
.
Failures:
@lower01 can you show me the command you used?
@lower01 can you show me the command you used?
好的,我的命令中的内容如下:
torchrun
--nproc_per_node 1
-m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""
@lower01 can you show me the command you used?
您好,我昨天给您发了命令,请问我的命令有什么问题吗?对于昨天提到的RuntimeError: Distributed package doesn't have NCCL built in ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 24584) of binary: F:\anaconda3\envs\RAG\python.exe错误,是本机没有nccl库的原因吗?但是windows上好像不能安装nccl库,还是什么其它问题?
Is there still an error when using python
?
python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--negatives_cross_device
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""
python -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --negatives_cross_device --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""
换成python之后好像的确没有命令执行上的问题了,但是还存在与分布式训练相关的错误
You can delete negatives_cross_device
, which needs to initialize a distribution env.
python -m FlagEmbedding.baai_general_embedding.finetune.run
--output_dir bge-large-zh-finetune
--model_name_or_path bge-large-zh
--train_data examples/finetune/toy_finetune_data.jsonl
--learning_rate 1e-5
--fp16
--num_train_epochs 1
--per_device_train_batch_size 1
--dataloader_drop_last True
--normlized True
--temperature 0.02
--query_max_len 64
--passage_max_len 256
--train_group_size 2
--logging_steps 10
--save_steps 1000
--query_instruction_for_retrieval ""
Besides, I think the per_device_train_batch_size
is too small, might influence the performance.
You can delete
negatives_cross_device
, which needs to initialize a distribution env.python -m FlagEmbedding.baai_general_embedding.finetune.run --output_dir bge-large-zh-finetune --model_name_or_path bge-large-zh --train_data examples/finetune/toy_finetune_data.jsonl --learning_rate 1e-5 --fp16 --num_train_epochs 1 --per_device_train_batch_size 1 --dataloader_drop_last True --normlized True --temperature 0.02 --query_max_len 64 --passage_max_len 256 --train_group_size 2 --logging_steps 10 --save_steps 1000 --query_instruction_for_retrieval ""
Besides, I think the
per_device_train_batch_size
is too small, might influence the performance.
好的,谢谢您,现在能够加载对应的数据,但是有一个type error如下:
我不能很好地定位问题
You can try to change the version of the transformers package, e.g., 4.37 pip install transformers==4.37
You can try to change the version of the transformers package, e.g., 4.37
pip install transformers==4.37
我把我的transformer从4.41变为4.37,运行出现了3个报错问题:
对于第3个问题,我本地的模型权重文件应该是没有问题,因为之前试验过加载模型对一个句子进行嵌入,我查阅了一下资料,但是说法很有,没有很好地解决
You can try to change the version of the transformers package, e.g., 4.37
pip install transformers==4.37
您好,我找到问题了,应该把transformer换成4.37.2,修改后再尝试既没有出现上面3个问题,也没有前面那个问题了 谢谢您
You can try to change the version of the transformers package, e.g., 4.37
pip install transformers==4.37
您好,还想请问一下正常情况下需要多少显存的GPU微调bge-large-zh呢?在--per_device_train_batch_size 1的情况下我的4G显存无法支持微调
这个微调命令到底怎么搞啊?已经在命令行试了无数次了,无法识别和找到torchrun命令,torch版本1.8没问题,查了很多资料,torchrun是用于分布式训练计算什么的,还要去nvidia官网安装nccl库什么的,但windows本地安装nccl又好麻烦啊,而且也不知道是不是这个问题,我不想要分布式训练可以吗,就一张卡,torchrun那堆命令根本没办法识别啊???????????
有没有大佬指点一下,怎么运行命令啊到底???