THUDM / SwissArmyTransformer

SwissArmyTransformer is a flexible and powerful library to develop your own Transformer variants.

https://THUDM.github.io/SwissArmyTransformer

Apache License 2.0

951 stars 90 forks source link

TypeError: sat.model.transformer.BaseTransformer() got multiple values for keyword argument 'parallel_output' #179

Open deep-practice opened 1 month ago

deep-practice commented 1 month ago

加载visualglm模型的时候报错： For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK. Traceback (most recent call last): File "/root/TransGPT/multi_modal/hf_infer.py", line 3, in model = AutoModel.from_pretrained("THUDM/visualglm-6b", trust_remote_code=True).half().cuda() File "/root/.conda/envs/demo/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py", line 558, in from_pretrained return model_class.from_pretrained( File "/root/.conda/envs/demo/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2966, in from_pretrained model = cls(config, *model_args, model_kwargs) File "/root/.cache/huggingface/modules/transformers_modules/THUDM/visualglm-6b/f4f759acde0926fefcd35e2c626e08adb452eff8/modeling_chatglm.py", line 1345, in init self.image_encoder = BLIP2(config.eva_config, config.qformer_config) File "/root/.cache/huggingface/modules/transformers_modules/THUDM/visualglm-6b/f4f759acde0926fefcd35e2c626e08adb452eff8/visual.py", line 59, in init self.vit = EVAViT(EVAViT.get_args(eva_args)) File "/root/.cache/huggingface/modules/transformers_modules/THUDM/visualglm-6b/f4f759acde0926fefcd35e2c626e08adb452eff8/visual.py", line 20, in init super().init(args, transformer=transformer, parallel_output=parallel_output, kwargs) File "/root/.conda/envs/demo/lib/python3.10/site-packages/sat/model/official/vit_model.py", line 111, in init super().init(args, transformer=transformer, kwargs) File "/root/.conda/envs/demo/lib/python3.10/site-packages/sat/model/base_model.py", line 93, in init self.transformer = BaseTransformer( TypeError: sat.model.transformer.BaseTransformer() got multiple values for keyword argument 'parallel_output'

BeiZhangChen commented 1 month ago

Hey, guy. I have the same question. Did you find out how to deal with it?

1049451037 commented 1 month ago

Update code to the latest main branch. As you can see, the parallel_output argument has been deleted in VisualGLM:

https://github.com/THUDM/VisualGLM-6B/blob/7a277433740276d7abc2a71646050c03062ea9e4/model/visualglm.py#L30-L31

corkiyao commented 2 weeks ago

Update code to the latest main branch. As you can see, the parallel_output argument has been deleted in VisualGLM:

https://github.com/THUDM/VisualGLM-6B/blob/7a277433740276d7abc2a71646050c03062ea9e4/model/visualglm.py#L30-L31

你好，我使用了SwissArmyTransformer 0.4.12，visualglm的代码也是这两天git clone的，但是是遇到了TypeError: type object got multiple values for keyword argument 'parallel_output'.。报错信息如下：（倒数第二行） [2024-08-23 10:19:12,971] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) [WARNING] async_io requires the dev libaio .so object and headers but these were not found. [WARNING] async_io: please install the libaio-dev package with apt [WARNING] If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found. [2024-08-23 10:19:14,750] [INFO] [launch.py:139:main] 0 NCCL_IB_DISABLE=0 [2024-08-23 10:19:14,750] [INFO] [launch.py:139:main] 0 NCCL_DEBUG=info [2024-08-23 10:19:14,750] [INFO] [launch.py:139:main] 0 NCCL_NET_GDR_LEVEL=2 [2024-08-23 10:19:14,750] [INFO] [launch.py:146:main] WORLD INFO DICT: {'localhost': [0, 1]} [2024-08-23 10:19:14,750] [INFO] [launch.py:152:main] nnodes=1, num_local_procs=2, node_rank=0 [2024-08-23 10:19:14,750] [INFO] [launch.py:163:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]}) [2024-08-23 10:19:14,750] [INFO] [launch.py:164:main] dist_world_size=2 [2024-08-23 10:19:14,750] [INFO] [launch.py:168:main] Setting CUDA_VISIBLE_DEVICES=0,1 [2024-08-23 10:19:20,026] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-08-23 10:19:20,028] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-23 10:19:20,030] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-08-23 10:19:20,032] [INFO] [checkpointing.py:1049:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2024-08-23 10:19:20,032] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2024-08-23 10:19:20,033] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... [2024-08-23 10:19:20,034] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 Traceback (most recent call last): File "finetune_visualglm.py", line 179, in model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args, overwrite_args={'model_parallel_size': 1}) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 222, in from_pretrained model, model_args = cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=True, overwrite_args=overwrite_args, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 210, in from_pretrained_base model = get_model(args, cls, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 421, in get_model model = model_cls(args, params_dtype=params_dtype, kwargs) File "finetune_visualglm.py", line 13, in init super().init(args, transformer=transformer, parallel_output=parallel_output, kw_args) File "/home/data/yaoyunze/visualglm2/VisualGLM-6B/model/visualglm.py", line 32, in init super().init(args, transformer=transformer, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/official/chatglm_model.py", line 167, in init super(ChatGLMModel, self).init(args, transformer=transformer, activation_func=gelu, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 93, in init self.transformer = BaseTransformer( TypeError: type object got multiple values for keyword argument 'parallel_output' [2024-08-23 10:19:21,788] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1845345 [2024-08-23 10:19:21,825] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 1845346

1049451037 commented 2 weeks ago

感谢指出，已修复：

https://github.com/THUDM/VisualGLM-6B/blob/f07e547e39a75bb51b63d2a8b955c3b8ae5a5e0d/finetune_visualglm.py#L13

corkiyao commented 2 weeks ago

感谢指出，已修复：

https://github.com/THUDM/VisualGLM-6B/blob/f07e547e39a75bb51b63d2a8b955c3b8ae5a5e0d/finetune_visualglm.py#L13

感谢。但是我想用qlora微调时候，还是遇到相同的问题，似乎还是有问题 [2024-08-23 11:50:40,331] [INFO] using world size: 1 and model-parallel size: 1 [2024-08-23 11:50:40,332] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128) [2024-08-23 11:50:40,333] [INFO] [RANK 0] > initializing model parallel with size 1 [2024-08-23 11:50:40,334] [INFO] [comm.py:637:init_distributed] cdb=None [2024-08-23 11:50:40,335] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead [2024-08-23 11:50:40,336] [INFO] [checkpointing.py:1049:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False} [2024-08-23 11:50:40,336] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2024-08-23 11:50:40,336] [INFO] [RANK 0] building FineTuneVisualGLMModel model ... [2024-08-23 11:50:40,337] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 [2024-08-23 11:50:45,032] [INFO] [checkpointing.py:229:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234 Traceback (most recent call last): File "finetune_visualglm.py", line 179, in model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args, overwrite_args={'model_parallel_size': 1}) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 221, in from_pretrained model, model_args = cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=True, overwrite_args=overwrite_args, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 209, in from_pretrained_base model = get_model(args, cls, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 420, in get_model model = model_cls(args, params_dtype=params_dtype, kwargs) File "finetune_visualglm.py", line 13, in init super().init(args, transformer=transformer, kw_args) File "/home/data/yaoyunze/visualglm2/VisualGLM-6B/model/visualglm.py", line 34, in init self.add_mixin("eva", ImageMixin(args)) File "/home/data/yaoyunze/visualglm2/VisualGLM-6B/model/visualglm.py", line 18, in init self.model = BLIP2(args.eva_args, args.qformer_args) File "/home/data/yaoyunze/visualglm2/VisualGLM-6B/model/blip2.py", line 56, in init self.vit = EVAViT(EVAViT.get_args(eva_args)) File "/home/data/yaoyunze/visualglm2/VisualGLM-6B/model/blip2.py", line 21, in init super().init(args, transformer=transformer, parallel_output=parallel_output, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/official/vit_model.py", line 111, in init super().init(args, transformer=transformer, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 93, in init self.transformer = BaseTransformer( TypeError: type object got multiple values for keyword argument 'parallel_output'**------------------------------------------报错 [2024-08-23 11:50:46,232] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 17713

corkiyao commented 2 weeks ago

这是我的

class FineTuneVisualGLMModel(VisualGLMModel): def init(self, args, transformer=None, kw_args): super().init(args, transformer=transformer, kw_args) if args.use_ptuning: self.add_mixin("ptuning", PTuningV2Mixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.pre_seq_len)) if args.use_lora: self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)

self.get_mixin("eva").model.glm_proj = replace_linear_with_lora(self.get_mixin("eva").model.glm_proj, LoraLinear, args.lora_rank)

    elif args.use_qlora:
        self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True)
    self.args = args

@classmethod
def add_model_specific_args(cls, parser):
    group = parser.add_argument_group('VisualGLM-finetune', 'VisualGLM finetune Configurations')
    group.add_argument('--pre_seq_len', type=int, default=8)
    group.add_argument('--lora_rank', type=int, default=10)
    group.add_argument('--use_ptuning', action="store_true")
    group.add_argument('--use_lora', action="store_true")
    group.add_argument('--use_qlora', action="store_true")
    group.add_argument('--layer_range', nargs='+', type=int, default=None)
    return super().add_model_specific_args(parser)

1049451037 commented 2 weeks ago

改了，再试试

corkiyao commented 2 weeks ago

我意思是我也改了：这是我修改后的程序： class FineTuneVisualGLMModel(VisualGLMModel): def init(self, args, transformer=None, kw_args): -----------------------------------没有parallel_output了 super().init(args, transformer=transformer, kw_args) if args.use_ptuning: self.add_mixin("ptuning", PTuningV2Mixin(args.num_layers, args.hidden_size // args.num_attention_heads, args.num_attention_heads, args.pre_seq_len)) if args.use_lora: self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range), reinit=True)

self.get_mixin("eva").model.glm_proj = replace_linear_with_lora(self.get_mixin("eva").model.glm_proj, LoraLinear, args.lora_rank)

    elif args.use_qlora:
        self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True)
    self.args = args

1049451037 commented 2 weeks ago

你先pull一下新的代码试试，因为你没改完全。

corkiyao commented 2 weeks ago

你先pull一下新的代码试试，因为你没改完全。

好吧，我pull下

corkiyao commented 2 weeks ago

你先pull一下新的代码试试，因为你没改完全。

更新了visualglm-6b的代码和重新git clone SwissArmyTransformer最新版。但是还遇到问题，这个问题是啥意思呢？ [2024-08-23 12:56:55,745] [INFO] [RANK 0] replacing layer 0 attention with lora Traceback (most recent call last): File "finetune_visualglm.py", line 178, in model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 217, in from_pretrained return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 209, in from_pretrained_base model = get_model(args, cls, kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 420, in get_model model = model_cls(args, params_dtype=params_dtype, *kwargs) File "finetune_visualglm.py", line 20, in init self.add_mixin("lora", LoraMixin(args.num_layers, args.lora_rank, layer_range=args.layer_range, qlora=True), reinit=True) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 123, in add_mixin new_mixin.reinit(self) # also pass current mixins File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 206, in reinit parent_model.transformer.layers[i].attention.dense = replace_linear_with_lora(parent_model.transformer.layers[i].attention.dense, 1, self.r, self.lora_alpha, self.lora_dropout, qlora=self.qlora, in_size=parent_model.transformer.hidden_size, out_size=None) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 154, in replace_linear_with_lora new_layer = LoraLinear(original_cls, partition, in_dim, out_dim, r, args, **kw_args, original_obj=lin) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 108, in init self.matrix_A = HackParameterList([nn.Parameter(torch.empty((r, originalobj.weight.shape[1]), dtype=dtype)) for in range(partition)]) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 108, in self.matrix_A = HackParameterList([nn.Parameter(torch.empty((r, originalobj.weight.shape[1]), dtype=dtype)) for in range(partition)]) NameError: free variable 'dtype' referenced before assignment in enclosing scope --------------》问题在这里 [2024-08-23 12:56:57,576] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 66497

我吧sat的权重放在satckpt文件夹了，这是finetune_visualglm.py的代码：但是这个应该不是产生dtype的问题根本？ if name == 'main': py_parser = argparse.ArgumentParser(add_help=False) py_parser.add_argument('--max_source_length', type=int) py_parser.add_argument('--max_target_length', type=int) py_parser.add_argument('--ignore_pad_token_for_loss', type=bool, default=True)

py_parser.add_argument('--old_checkpoint', action="store_true")

py_parser.add_argument('--source_prefix', type=str, default="")
py_parser = FineTuneVisualGLMModel.add_model_specific_args(py_parser)
known, args_list = py_parser.parse_known_args()
args = get_args(args_list)
args = argparse.Namespace(**vars(args), **vars(known))
args.device = 'cpu'

model_type = 'satckpt'              --------》修改部分在这里
model, args = FineTuneVisualGLMModel.from_pretrained(model_type, args)
if torch.cuda.is_available():
    model = model.to('cuda')
tokenizer = get_tokenizer(args)
label_pad_token_id = -100 if args.ignore_pad_token_for_loss else tokenizer.pad_token_id
def data_collator(examples):
    for example in examples:
        example['input_ids'] = torch.tensor(example['input_ids'], dtype=torch.long)
        example['labels'] = torch.tensor(example['labels'], dtype=torch.long)
    ret = {
        'input_ids': torch.stack([example['input_ids'] for example in examples]),
        'labels': torch.stack([example['labels'] for example in examples]),
        'image': torch.stack([example['image'] for example in examples]),
        'pre_image': example['pre_image']
    }

1049451037 commented 2 weeks ago

更新了一下sat，你再试试呢？

corkiyao commented 2 weeks ago

更新了一下sat，你再试试呢？

再一次试了，之前的问题都解决了。但是遇到了另外一个问题：No backend type associated with device type cpu 。这是我的CPU内存不够吗，64G的内存。加载起来应该是够的，或许GPU不够？我看readme是说q-lora是显存10G也够的，我的单个GPU显存为11G。 amax:102429:103953 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. amax:102429:103953 [0] NCCL INFO Failed to open libibverbs.so[.1] amax:102429:103953 [0] NCCL INFO NET/Socket : Using [0]enp129s0f0:192.168.1.25<0> amax:102429:103953 [0] NCCL INFO Using network Socket amax:102429:103953 [0] NCCL INFO comm 0xd74cb40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 4000 commId 0x729870e4af3743cd - Init START amax:102429:103953 [0] NCCL INFO Setting affinity for GPU 0 to 0f,ff000fff amax:102429:103953 [0] NCCL INFO Channel 00/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 01/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 02/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 03/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 04/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 05/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 06/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 07/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 08/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 09/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 10/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 11/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 12/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 13/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 14/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 15/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 16/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 17/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 18/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 19/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 20/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 21/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 22/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 23/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 24/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 25/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 26/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 27/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 28/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 29/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 30/32 : 0 amax:102429:103953 [0] NCCL INFO Channel 31/32 : 0 amax:102429:103953 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 amax:102429:103953 [0] NCCL INFO P2P Chunksize set to 131072 amax:102429:103953 [0] NCCL INFO Connected all rings amax:102429:103953 [0] NCCL INFO Connected all trees amax:102429:103953 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer amax:102429:103953 [0] NCCL INFO comm 0xd74cb40 rank 0 nranks 1 cudaDev 0 nvmlDev 0 busId 4000 commId 0x729870e4af3743cd - Init COMPLETE transformer.layers.0.attention.query_key_value.matrix_A.0 transformer.layers.0.attention.query_key_value.matrix_A.1 transformer.layers.0.attention.query_key_value.matrix_A.2 transformer.layers.0.attention.query_key_value.matrix_B.0 transformer.layers.0.attention.query_key_value.matrix_B.1 transformer.layers.0.attention.query_key_value.matrix_B.2 transformer.layers.0.attention.dense.matrix_A.0 transformer.layers.0.attention.dense.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_A.0 transformer.layers.14.attention.query_key_value.matrix_A.1 transformer.layers.14.attention.query_key_value.matrix_A.2 transformer.layers.14.attention.query_key_value.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_B.1 transformer.layers.14.attention.query_key_value.matrix_B.2 transformer.layers.14.attention.dense.matrix_A.0 transformer.layers.14.attention.dense.matrix_B.0 [2024-08-23 13:44:20,101] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay [2024-08-23 13:44:20,103] [INFO] [RANK 0] Syncing initialized parameters... Traceback (most recent call last): File "finetune_visualglm.py", line 194, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 116, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 196, in setup_model_untrainable_params_and_optimizer dist.broadcast( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts) RuntimeError: No backend type associated with device type cpu -------------------》问题在这里 [2024-08-23 13:44:22,594] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 102429

1049451037 commented 2 weeks ago

我感觉是你的机器没有显卡：

https://github.com/THUDM/VisualGLM-6B/blob/c468ec2e56e02564fcd46f507b32d522d72b8210/finetune_visualglm.py#L179-L180

可以看到这里的代码，如果cuda available，模型就会在cuda上，而不是在cpu上。你可以在这行代码if里加一个断点，确认是否运行了.cuda()。

corkiyao commented 2 weeks ago

我感觉是你的机器没有显卡：

https://github.com/THUDM/VisualGLM-6B/blob/c468ec2e56e02564fcd46f507b32d522d72b8210/finetune_visualglm.py#L179-L180

可以看到这里的代码，如果cuda available，模型就会在cuda上，而不是在cpu上。你可以在这行代码if里加一个断点，确认是否运行了.cuda()。

是有的，有单机8张卡。只想用一张。在第一行加载权重之后，我在if torch.cuda.is_available() 里面加了print("111111111111111111111111111111")。结果打印了出来。而且我昨天还用了GPU跑其他程序，都可以正常调用。我使用few-shot的官方例子数据进行微调，但是目前是会发生这个问题。

[2024-08-23 14:46:54,366] [INFO] [RANK 0] > successfully loaded satckpt/1/mp_rank_00_model_states.pt 111111111111111111111111111111 [2024-08-23 14:47:07,513] [INFO] [RANK 0] Try to load tokenizer from Huggingface transformers... [2024-08-23 14:47:12,541] [INFO] [RANK 0] > Set tokenizer as a /home/data/yaoyunze/visualglm2/VisualGLM-6B/chatckpt tokenizer! Now you can get_tokenizer() everywhere. amax:144840:144840 [0] NCCL INFO Bootstrap : Using enp129s0f0:192.168.1.25<0> amax:144840:144840 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory amax:144840:144840 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation amax:144840:144840 [0] NCCL INFO cudaDriverVersion 12020 NCCL version 2.18.5+cuda11.8 amax:144840:179898 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0. amax:144840:179898 [0] NCCL INFO Failed to open libibverbs.so[.1] amax:144840:179898 [0] NCCL INFO NET/Socket : Using [0]enp129s0f0:192.168.1.25<0> amax:144840:179898 [0] NCCL INFO Using network Socket amax:144840:179898 [0] NCCL INFO comm 0x90c7410 rank 0 nranks 1 cudaDev 0 nvmlDev 1 busId 5000 commId 0x2b39f7ea9901b6b4 - Init START amax:144840:179898 [0] NCCL INFO Setting affinity for GPU 1 to 0f,ff000fff amax:144840:179898 [0] NCCL INFO Channel 00/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 01/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 02/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 03/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 04/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 05/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 06/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 07/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 08/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 09/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 10/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 11/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 12/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 13/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 14/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 15/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 16/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 17/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 18/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 19/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 20/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 21/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 22/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 23/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 24/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 25/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 26/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 27/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 28/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 29/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 30/32 : 0 amax:144840:179898 [0] NCCL INFO Channel 31/32 : 0 amax:144840:179898 [0] NCCL INFO Trees [0] -1/-1/-1->0->-1 [1] -1/-1/-1->0->-1 [2] -1/-1/-1->0->-1 [3] -1/-1/-1->0->-1 [4] -1/-1/-1->0->-1 [5] -1/-1/-1->0->-1 [6] -1/-1/-1->0->-1 [7] -1/-1/-1->0->-1 [8] -1/-1/-1->0->-1 [9] -1/-1/-1->0->-1 [10] -1/-1/-1->0->-1 [11] -1/-1/-1->0->-1 [12] -1/-1/-1->0->-1 [13] -1/-1/-1->0->-1 [14] -1/-1/-1->0->-1 [15] -1/-1/-1->0->-1 [16] -1/-1/-1->0->-1 [17] -1/-1/-1->0->-1 [18] -1/-1/-1->0->-1 [19] -1/-1/-1->0->-1 [20] -1/-1/-1->0->-1 [21] -1/-1/-1->0->-1 [22] -1/-1/-1->0->-1 [23] -1/-1/-1->0->-1 [24] -1/-1/-1->0->-1 [25] -1/-1/-1->0->-1 [26] -1/-1/-1->0->-1 [27] -1/-1/-1->0->-1 [28] -1/-1/-1->0->-1 [29] -1/-1/-1->0->-1 [30] -1/-1/-1->0->-1 [31] -1/-1/-1->0->-1 amax:144840:179898 [0] NCCL INFO P2P Chunksize set to 131072 amax:144840:179898 [0] NCCL INFO Connected all rings amax:144840:179898 [0] NCCL INFO Connected all trees amax:144840:179898 [0] NCCL INFO 32 coll channels, 0 nvls channels, 32 p2p channels, 32 p2p channels per peer amax:144840:179898 [0] NCCL INFO comm 0x90c7410 rank 0 nranks 1 cudaDev 0 nvmlDev 1 busId 5000 commId 0x2b39f7ea9901b6b4 - Init COMPLETE transformer.layers.0.attention.query_key_value.matrix_A.0 transformer.layers.0.attention.query_key_value.matrix_A.1 transformer.layers.0.attention.query_key_value.matrix_A.2 transformer.layers.0.attention.query_key_value.matrix_B.0 transformer.layers.0.attention.query_key_value.matrix_B.1 transformer.layers.0.attention.query_key_value.matrix_B.2 transformer.layers.0.attention.dense.matrix_A.0 transformer.layers.0.attention.dense.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_A.0 transformer.layers.14.attention.query_key_value.matrix_A.1 transformer.layers.14.attention.query_key_value.matrix_A.2 transformer.layers.14.attention.query_key_value.matrix_B.0 transformer.layers.14.attention.query_key_value.matrix_B.1 transformer.layers.14.attention.query_key_value.matrix_B.2 transformer.layers.14.attention.dense.matrix_A.0 transformer.layers.14.attention.dense.matrix_B.0 [2024-08-23 14:47:35,179] [INFO] [RANK 0] [<class 'sat.ops.layernorm.LayerNorm'>, <class 'torch.nn.modules.normalization.LayerNorm'>, <class 'sat.ops.layernorm.RMSNorm'>] is set to no_weight_decay [2024-08-23 14:47:35,191] [INFO] [RANK 0] Syncing initialized parameters... Traceback (most recent call last): File "finetune_visualglm.py", line 196, in training_main(args, model_cls=model, forward_step_function=forward_step, create_dataset_function=create_dataset_function, collate_fn=data_collator) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 116, in training_main model, optimizer = setup_model_untrainable_params_and_optimizer(args, model) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/deepspeed_training.py", line 196, in setup_model_untrainable_params_and_optimizer dist.broadcast( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 47, in wrapper return func(*args, **kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1906, in broadcast work = default_pg.broadcast([tensor], opts)

RuntimeError: No backend type associated with device type cpu ------>相同问题-------------------------------------------

[2024-08-23 14:47:47,702] [INFO] [launch.py:319:sigkill_handler] Killing subprocess 144840

[2024-08-23 14:47:47,975] [ERROR] [launch.py:325:sigkill_handler] ['/home/yaoyunze/anaconda3/envs/visualglm/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=0', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '1', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = 1

代码显示的问题在这里。 if not check_if_zero3(args): print_rank0('Syncing initialized parameters...') for param_group in param_groups: for param in param_group['params']: if not param.model_parallel:

We already keep the same random seed for different ranks

                # However, it is not reliable. Non-model-parallel parameters could be different when initialization.
                dist.broadcast(param.data,   ------------------------------->这里不知道为啥会报错
                    src=0, # group is default group
                )

1049451037 commented 2 weeks ago

修复了，pull一下试试：

https://github.com/THUDM/VisualGLM-6B/blob/e314fb9c4e778851414f39784317c72765acec47/finetune_visualglm.py#L181

corkiyao commented 2 weeks ago

修复了，pull一下试试：

https://github.com/THUDM/VisualGLM-6B/blob/e314fb9c4e778851414f39784317c72765acec47/finetune_visualglm.py#L181

好的

corkiyao commented 2 weeks ago

修复了，pull一下试试：

https://github.com/THUDM/VisualGLM-6B/blob/e314fb9c4e778851414f39784317c72765acec47/finetune_visualglm.py#L181

好了，能训练起来了，显存只用到9GB，非常感谢。

corkiyao commented 2 weeks ago

修复了，pull一下试试：

https://github.com/THUDM/VisualGLM-6B/blob/e314fb9c4e778851414f39784317c72765acec47/finetune_visualglm.py#L181

好吧，qlora微调完之后，推理阶段出现维度不匹配的问题。看过之前的issue，但是也没有说出现这种情况.......

[2024-08-23 17:10:35,890] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-08-23 17:10:36,839] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-08-23 17:10:37,829] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2024-08-23 17:11:50,826] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2024-08-23 17:12:44,922] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/data/yaoyunze/visualglm4/VisualGLM-6B-main/checkpoints/finetune-visualglm-6b-08-23-16-41/300/mp_rank_00_model_states.pt Traceback (most recent call last): File "cli_demo.py", line 103, in main() File "cli_demo.py", line 30, in main model, model_args = AutoModel.from_pretrained( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 342, in from_pretrained return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/base_model.py", line 336, in from_pretrained_base load_checkpoint(model, args, load_path=model_path, prefix=prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/model_io.py", line 304, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2138, in load_state_dict load(self, state_dict) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2120, in load module._load_from_state_dict( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_statedict self.weight.data.copy(state_dict[prefix+'weight']) RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0

1049451037 commented 2 weeks ago

试一下改成这样：

    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ), build_only=True)
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)

corkiyao commented 2 weeks ago

试一下改成这样：

    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
        device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
    ), build_only=True)
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)

`def main(): parser = argparse.ArgumentParser() parser.add_argument("--max_length", type=int, default=2048, help='max length of the total sequence') parser.add_argument("--top_p", type=float, default=0.4, help='top p for nucleus sampling') parser.add_argument("--top_k", type=int, default=100, help='top k for top k sampling') parser.add_argument("--temperature", type=float, default=.8, help='temperature for sampling') parser.add_argument("--english", action='store_true', help='only output English') parser.add_argument("--quant", choices=[8, 4], type=int, default=4, help='quantization bits') parser.add_argument("--from_pretrained", type=str, default="visualglm-6b", help='pretrained ckpt') parser.add_argument("--prompt_zh", type=str, default="描述这张图片。", help='Chinese prompt for the first round') parser.add_argument("--prompt_en", type=str, default="Describe the image.", help='English prompt for the first round') args = parser.parse_args()

# load model
# model, model_args = AutoModel.from_pretrained(
#     args.from_pretrained,
#     args=argparse.Namespace(
#     fp16=True,
#     skip_init=True,
#     use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
#     device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
# ))

model, model_args = AutoModel.from_pretrained(
    args.from_pretrained,
    args=argparse.Namespace(
    fp16=True,
    skip_init=True,
    use_gpu_initialization=True if (torch.cuda.is_available() and args.quant is None) else False,
    device='cuda' if (torch.cuda.is_available() and args.quant is None) else 'cpu',
), build_only=True)
from sat.training.model_io import load_checkpoint
load_checkpoint(model, model_args, args.from_pretrained)

model = model.eval()

if args.quant:
    quantize(model, args.quant)
    if torch.cuda.is_available():
        model = model.cuda()
        args.device = 'cuda'

model.add_mixin('auto-regressive', CachedAutoregressiveMixin())

tokenizer = AutoTokenizer.from_pretrained("THUDM/chatglm-6b", trust_remote_code=True)`

这样吗？

1049451037 commented 2 weeks ago

是的

corkiyao commented 2 weeks ago

不太行，还是一样的问题。 [2024-08-23 17:52:48,818] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/data/yaoyunze/visualglm4/VisualGLM-6B-main/checkpoints/finetune-visualglm-6b-08-23-16-41/300/mp_rank_00_model_states.pt Traceback (most recent call last): File "cli_demo.py", line 116, in main() File "cli_demo.py", line 48, in main load_checkpoint(model, model_args, args.from_pretrained) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/model_io.py", line 304, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2138, in load_state_dict load(self, state_dict) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2120, in load module._load_from_state_dict( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 49, in _load_from_statedict self.weight.data.copy(state_dict[prefix+'weight']) RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0

corkiyao commented 2 weeks ago

是的

看到之前有人提问，我试一试这个库。我现在的bitsandbytes是0.43.3版本

1049451037 commented 2 weeks ago

这样：

    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)

corkiyao commented 2 weeks ago

这样：

    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)

佬，晚上八点前还在吗？这个问题我现在还在试着，但是加载较慢，估计时间很久。但是我怕解决不了

1049451037 commented 2 weeks ago

这里用着比较复杂主要是因为bitsandbytes把做量化这件事放到了.to('cuda')函数里。训练的时候加载的是没有量化的权重，因此需要构造模型->加载权重->.to('cuda')，但是训练存下来的是量化以后的权重，所以inference需要构造模型->.to('cuda')->加载权重。

corkiyao commented 2 weeks ago

这里用着比较复杂主要是因为bitsandbytes把做量化这件事放到了.to('cuda')函数里。训练的时候加载的是没有量化的权重，因此需要构造模型->加载权重->.to('cuda')，但是训练存下来的是量化以后的权重，所以inference需要构造模型->.to('cuda')->加载权重。

嗯嗯，我再试试。

corkiyao commented 2 weeks ago

这里用着比较复杂主要是因为bitsandbytes把做量化这件事放到了.to('cuda')函数里。训练的时候加载的是没有量化的权重，因此需要构造模型->加载权重->.to('cuda')，但是训练存下来的是量化以后的权重，所以inference需要构造模型->.to('cuda')->加载权重。

又出错了..........是bitsandbytes版本问题？ home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/init.py:412: UserWarning: Initializing zero-element tensors is a no-op warnings.warn("Initializing zero-element tensors is a no-op") [2024-08-23 18:06:12,797] [INFO] [RANK 0] replacing layer 0 attention with lora [2024-08-23 18:06:13,492] [INFO] [RANK 0] replacing layer 14 attention with lora [2024-08-23 18:06:14,199] [INFO] [RANK 0] replacing chatglm linear layer with 4bit [2024-08-23 18:07:15,413] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768 [2024-08-23 18:07:29,400] [INFO] [RANK 0] global rank 0 is loading checkpoint /home/data/yaoyunze/visualglm4/VisualGLM-6B-main/checkpoints/finetune-visualglm-6b-08-23-16-41/300/mp_rank_00_model_states.pt Traceback (most recent call last): File "cli_demo.py", line 130, in if name == "main": File "cli_demo.py", line 60, in main from sat.training.model_io import load_checkpoint File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/training/model_io.py", line 304, in load_checkpoint missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2138, in load_state_dict load(self, state_dict) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2126, in load load(child, child_state_dict, child_prefix) [Previous line repeated 3 more times] File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 2120, in load module._load_from_state_dict( File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 51, in _load_from_state_dict copy_nested_list(state_dict[prefix+'quant_state'], self.weight.quant_state) File "/home/yaoyunze/anaconda3/envs/visualglm/lib/python3.8/site-packages/sat/model/finetune/lora2.py", line 39, in copy_nested_list for i in range(len(dst)): TypeError: object of type 'QuantState' has no len()

1049451037 commented 2 weeks ago

修好了，重新pull一下最新的sat。然后cli_demo用这个：


    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

但是需要重新训练，因为save模型的逻辑也改了。

corkiyao commented 2 weeks ago

修好了，重新pull一下最新的sat。然后cli_demo用这个：

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

但是需要重新训练，因为save模型的逻辑也改了。

好滴，我重新训练下

corkiyao commented 2 weeks ago

修好了，重新pull一下最新的sat。然后cli_demo用这个：

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

但是需要重新训练，因为save模型的逻辑也改了。

不知道是不是训练的问题，预测的结果似乎不对劲。这是我的cli_demo.py

model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

    model.add_mixin('auto-regressive', CachedAutoregressiveMixin())

    tokenizer = AutoTokenizer.from_pretrained("/home/data/yaoyunze/visualglm2/VisualGLM-6B/chatckpt", trust_remote_code=True) #本地路径
    if not args.english:
        print('欢迎使用 VisualGLM-6B 模型，输入图像URL或本地路径读图，继续输入内容对话，clear 重新开始，stop 终止程序')
    else:
        print('Welcome to VisualGLM-6B model. Enter an image URL or local file path to load an image. Continue inputting text to engage in a conversation. Type "clear" to start over, or "stop" to end the program.')
    with torch.no_grad():
        while True:
            history = None
            cache_image = None
            if not args.english:
                image_path = input("请输入图像路径或URL（回车进入纯文本对话）： ")
            else:
                image_path = input("Please enter the image path or URL (press Enter for plain text conversation): ")

            if image_path == 'stop':
                break
            if len(image_path) > 0:
                query = args.prompt_en if args.english else args.prompt_zh
            else:
                if not args.english:
                    query = input("用户：")
                else:
                    query = input("User: ")
            while True:
                if query == "clear":
                    break
                if query == "stop":
                    sys.exit(0)
                try:
                    response, history, cache_image = chat(
                        image_path, 
                        model, 
                        tokenizer,
                        query, 
                        history=history, 
                        image=cache_image, 
                        max_length=args.max_length, 
                        top_p=args.top_p, 
                        temperature=args.temperature,
                        top_k=args.top_k,
                        english=args.english,
                        invalid_slices=[slice(63823, 130000)] if args.english else []
                        )
                except Exception as e:
                    print(e)
                    break
                sep = 'A:' if args.english else '答：'
                print("VisualGLM-6B："+response.split(sep)[-1].strip())
                image_path = None
                if not args.english:
                    query = input("用户：")
                else:
                    query = input("User: ")

结果：

请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
VisualGLM-6B：男女走在一起，相互依靠.
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
VisualGLM-6B：男女走在下雨的街道上，(label 是：这张图片的背景是蒙蒙细雨。)
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/ghost.jpg
VisualGLM-6B：这张图片的背景是一张桌子，桌子上有棋盘。（label 是：这张图片的背景是一个房间）
用户：

corkiyao commented 2 weeks ago

修好了，重新pull一下最新的sat。然后cli_demo用这个：

    # load model
    model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

但是需要重新训练，因为save模型的逻辑也改了。

不知道是不是训练的问题，预测的结果似乎不对劲。这是我的cli_demo.py

model, model_args = AutoModel.from_pretrained(
        args.from_pretrained,
        args=argparse.Namespace(
        fp16=True,
        skip_init=True,
        use_gpu_initialization=False,
        device='cpu',
    ), build_only=True)
    model = model.cuda()
    from sat.training.model_io import load_checkpoint
    load_checkpoint(model, model_args, args.from_pretrained)
    model = model.eval()

    model.add_mixin('auto-regressive', CachedAutoregressiveMixin())

    tokenizer = AutoTokenizer.from_pretrained("/home/data/yaoyunze/visualglm2/VisualGLM-6B/chatckpt", trust_remote_code=True) #本地路径
    if not args.english:
        print('欢迎使用 VisualGLM-6B 模型，输入图像URL或本地路径读图，继续输入内容对话，clear 重新开始，stop 终止程序')
    else:
        print('Welcome to VisualGLM-6B model. Enter an image URL or local file path to load an image. Continue inputting text to engage in a conversation. Type "clear" to start over, or "stop" to end the program.')
    with torch.no_grad():
        while True:
            history = None
            cache_image = None
            if not args.english:
                image_path = input("请输入图像路径或URL（回车进入纯文本对话）： ")
            else:
                image_path = input("Please enter the image path or URL (press Enter for plain text conversation): ")

            if image_path == 'stop':
                break
            if len(image_path) > 0:
                query = args.prompt_en if args.english else args.prompt_zh
            else:
                if not args.english:
                    query = input("用户：")
                else:
                    query = input("User: ")
            while True:
                if query == "clear":
                    break
                if query == "stop":
                    sys.exit(0)
                try:
                    response, history, cache_image = chat(
                        image_path, 
                        model, 
                        tokenizer,
                        query, 
                        history=history, 
                        image=cache_image, 
                        max_length=args.max_length, 
                        top_p=args.top_p, 
                        temperature=args.temperature,
                        top_k=args.top_k,
                        english=args.english,
                        invalid_slices=[slice(63823, 130000)] if args.english else []
                        )
                except Exception as e:
                    print(e)
                    break
                sep = 'A:' if args.english else '答：'
                print("VisualGLM-6B："+response.split(sep)[-1].strip())
                image_path = None
                if not args.english:
                    query = input("用户：")
                else:
                    query = input("User: ")

结果：

请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
VisualGLM-6B：男女走在一起，相互依靠.
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/2p.png
VisualGLM-6B：男女走在下雨的街道上，(label 是：这张图片的背景是蒙蒙细雨。)
用户：clear
请输入图像路径或URL（回车进入纯文本对话）： fewshot-data/ghost.jpg
VisualGLM-6B：这张图片的背景是一张桌子，桌子上有棋盘。（label 是：这张图片的背景是一个房间）
用户：

prompt提示是：--prompt_zh 这张图片的背景里有什么内容？