OpenMOSS / CoLLiE

Collaborative Training of Large Language Models in an Efficient Way
https://openlmlab-collie.readthedocs.io
Apache License 2.0
410 stars 58 forks source link

关于增加千问模型的支持 #153

Closed Jieni05 closed 8 months ago

Jieni05 commented 8 months ago

非常感谢你们的工作。 如果我想使用Qwen14b模型进行微调,应该如何实现?目前框架有支持qwen模型的相关内容吗

KaiLv69 commented 8 months ago

你好,目前的版本还没有自己写qwen,之后有计划支持。 目前可以直接用transformers里的模型+zero3,可以参考教程里的这里 https://openlmlab-collie.readthedocs.io/zh-cn/latest/tutorials/collie-tutorial-2-feature.html#2.2-%E2%80%82-CoLLiE-%E7%9A%84-%E5%85%BC%E5%AE%B9%E6%98%93%E7%94%A8

Jieni05 commented 8 months ago

了解了,如果使用transformers里的模型加载,可以使用collietrainer中的lomo等优化器进行训练吗?使用transformers的模型和collie重写的版本的主要区别是不支持tp和pp并行?

KaiLv69 commented 8 months ago

可以用lomo;是的

Jieni05 commented 8 months ago

可以用lomo;是的

我在尝试使用adalomo微调qwen14b模型(使用transformers的模型加载)时遇到了如下问题: AttributeError: 'Parameter' object has no attribute 'ds_shape'. 20240222171014

想请教一下关于这一点需要如何调整代码

KaiLv69 commented 8 months ago

你好,这里忘记考虑不用zero3的情况了,我在dev分支里修改了一下

https://github.com/OpenLMLab/collie/commit/9fbe590ae9368c407bc9ac36e209dfe84fae16be

Cheung-Z commented 6 months ago

可以用lomo;是的

我在尝试使用adalomo微调qwen14b模型(使用transformers的模型加载)时遇到了如下问题: AttributeError: 'Parameter' object has no attribute 'ds_shape'. 20240222171014

想请教一下关于这一点需要如何调整代码

请问您有在qwen模型上跑通么

KaiLv69 commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗?

Jieni05 commented 6 months ago

可以用lomo;是的

我在尝试使用adalomo微调qwen14b模型(使用transformers的模型加载)时遇到了如下问题: AttributeError: 'Parameter' object has no attribute 'ds_shape'. 20240222171014 想请教一下关于这一点需要如何调整代码

请问您有在qwen模型上跑通么

初步跑通了但是还有一些别的问题没有解决 后面还没来得及继续尝试。 如果是想使用lomo/adalomo微调或许可以参考最近他们关于集成到transformers里的工作:https://github.com/huggingface/transformers/issues/29649

Cheung-Z commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗? 感谢回复!

有改动的代码

`config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True) setup_distribution(config)

tokenizer = AutoTokenizer.from_pretrained( pretrained_model, model_max_length=512, padding_side="right", use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained(pretrained_model, device_map=None)`

执行命令

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_qwen.py

报错

Traceback (most recent call last): File "/workspace/zhangxu/CoLLiE/finetune_qwen.py", line 99, in trainer = Trainer( File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 192, in init self.setup_parallel_model() File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 284, in setup_parallelmodel self.engine, , , = setup_ds_engine( File "/workspace/zhangxu/CoLLiE/collie/utils/dist_utils.py", line 120, in setup_ds_engine assert isinstance( AssertionError: Currently pipeline or tensor parallelism only supports Collie models.

Cheung-Z commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗?

另外有一些warning,请问您实际使用的pytorch版本是多少 [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible

KaiLv69 commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗? 感谢回复!

有改动的代码

`config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True) setup_distribution(config)

tokenizer = AutoTokenizer.from_pretrained( pretrained_model, model_max_length=512, padding_side="right", use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained(pretrained_model, device_map=None)`

执行命令

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_qwen.py

报错

Traceback (most recent call last): File "/workspace/zhangxu/CoLLiE/finetune_qwen.py", line 99, in trainer = Trainer( File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 192, in init self.setup_parallel_model() File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 284, in setup_parallelmodel self.engine, , , = setup_ds_engine( File "/workspace/zhangxu/CoLLiE/collie/utils/dist_utils.py", line 120, in setup_ds_engine assert isinstance( AssertionError: Currently pipeline or tensor parallelism only supports Collie models.

hf里的automodel暂时还不支持tp和pp,只能用dp+zero