Jieni05 commented 8 months ago

非常感谢你们的工作。如果我想使用Qwen14b模型进行微调，应该如何实现？目前框架有支持qwen模型的相关内容吗

KaiLv69 commented 8 months ago

你好，目前的版本还没有自己写qwen，之后有计划支持。目前可以直接用transformers里的模型+zero3，可以参考教程里的这里 https://openlmlab-collie.readthedocs.io/zh-cn/latest/tutorials/collie-tutorial-2-feature.html#2.2-%E2%80%82-CoLLiE-%E7%9A%84-%E5%85%BC%E5%AE%B9%E6%98%93%E7%94%A8

Jieni05 commented 8 months ago

了解了，如果使用transformers里的模型加载，可以使用collietrainer中的lomo等优化器进行训练吗？使用transformers的模型和collie重写的版本的主要区别是不支持tp和pp并行？

KaiLv69 commented 8 months ago

可以用lomo；是的

Jieni05 commented 8 months ago

可以用lomo；是的

我在尝试使用adalomo微调qwen14b模型（使用transformers的模型加载）时遇到了如下问题： AttributeError: 'Parameter' object has no attribute 'ds_shape'. 20240222171014

想请教一下关于这一点需要如何调整代码

KaiLv69 commented 8 months ago

你好，这里忘记考虑不用zero3的情况了，我在dev分支里修改了一下

https://github.com/OpenLMLab/collie/commit/9fbe590ae9368c407bc9ac36e209dfe84fae16be

Cheung-Z commented 6 months ago

可以用lomo；是的

我在尝试使用adalomo微调qwen14b模型（使用transformers的模型加载）时遇到了如下问题： AttributeError: 'Parameter' object has no attribute 'ds_shape'.

想请教一下关于这一点需要如何调整代码

请问您有在qwen模型上跑通么

KaiLv69 commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗？

Jieni05 commented 6 months ago

可以用lomo；是的

我在尝试使用adalomo微调qwen14b模型（使用transformers的模型加载）时遇到了如下问题： AttributeError: 'Parameter' object has no attribute 'ds_shape'. 想请教一下关于这一点需要如何调整代码

请问您有在qwen模型上跑通么

初步跑通了但是还有一些别的问题没有解决后面还没来得及继续尝试。如果是想使用lomo/adalomo微调或许可以参考最近他们关于集成到transformers里的工作：https://github.com/huggingface/transformers/issues/29649

Cheung-Z commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗？感谢回复！

有改动的代码

`config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True) setup_distribution(config)

tokenizer = AutoTokenizer.from_pretrained( pretrained_model, model_max_length=512, padding_side="right", use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained(pretrained_model, device_map=None)`

执行命令

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_qwen.py

报错

Traceback (most recent call last): File "/workspace/zhangxu/CoLLiE/finetune_qwen.py", line 99, in trainer = Trainer( File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 192, in init self.setup_parallel_model() File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 284, in setup_parallelmodel self.engine, , , = setup_ds_engine( File "/workspace/zhangxu/CoLLiE/collie/utils/dist_utils.py", line 120, in setup_ds_engine assert isinstance( AssertionError: Currently pipeline or tensor parallelism only supports Collie models.

Cheung-Z commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗？

另外有一些warning，请问您实际使用的pytorch版本是多少 [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [WARNING] Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] NVIDIA Inference is only supported on Ampere and newer architectures [WARNING] sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.2 [WARNING] using untested triton version (2.2.0), only 1.0.0 is known to be compatible

KaiLv69 commented 6 months ago

@ShawnChang-ei 你好是遇到什么新问题了吗？感谢回复！

有改动的代码

`config = CollieConfig.from_pretrained(pretrained_model, trust_remote_code=True) setup_distribution(config)

tokenizer = AutoTokenizer.from_pretrained( pretrained_model, model_max_length=512, padding_side="right", use_fast=False, trust_remote_code=True ) model = AutoModelForCausalLM.from_pretrained(pretrained_model, device_map=None)`

执行命令

CUDA_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=1 --nproc_per_node=4 finetune_qwen.py

报错

Traceback (most recent call last): File "/workspace/zhangxu/CoLLiE/finetune_qwen.py", line 99, in trainer = Trainer( File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 192, in init self.setup_parallel_model() File "/workspace/zhangxu/CoLLiE/collie/controller/trainer.py", line 284, in setup_parallelmodel self.engine, , , = setup_ds_engine( File "/workspace/zhangxu/CoLLiE/collie/utils/dist_utils.py", line 120, in setup_ds_engine assert isinstance( AssertionError: Currently pipeline or tensor parallelism only supports Collie models.

hf里的automodel暂时还不支持tp和pp，只能用dp+zero

OpenMOSS / CoLLiE

关于增加千问模型的支持 #153

有改动的代码

执行命令

报错

有改动的代码

执行命令

报错