Attention mask的计算？

hust-nj commented 5 months ago

https://github.com/ParadoxZW/LLaVA-UHD-Better/blob/main/llava_uhd/adapt_llava.py#L136-L138

这里由于The first token is for CLS，是不是需要把

m[:w * h] = True

改成

m[:w * h+1] = True

hust-nj commented 5 months ago

以及check了下clip的code，attention mask在clip里应该-float('inf')才是表示mask而不是0或1表示mask？

ParadoxZW commented 4 months ago

https://github.com/ParadoxZW/LLaVA-UHD-Better/blob/main/llava_uhd/adapt_llava.py#L136-L138

这里由于The first token is for CLS，是不是需要把
m[:w * h] = True
改成
m[:w * h+1] = True

确实，您说的有道理，我会在后续更新中修改这个bug

ParadoxZW commented 4 months ago

以及check了下clip的code，attention mask在clip里应该-float('inf')才是表示mask而不是0或1表示mask？

这是我从~/miniconda3/envs/llv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py的CLIPEncoder类的forward函数的注释里拷贝出来的

attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)

是不是我们的transformers版本不一样？

hust-nj commented 4 months ago

以及check了下clip的code，attention mask在clip里应该-float('inf')才是表示mask而不是0或1表示mask？

这是我从~/miniconda3/envs/llv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py的CLIPEncoder类的forward函数的注释里拷贝出来的
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
是不是我们的transformers版本不一样？

我也注意到了这个注释，但是我去看了源代码，他是把这个attention_mask直接加到了attention weight上？

ParadoxZW commented 4 months ago

感谢您的关注与反馈。

P.S. 对于开放视觉encoder训练的支持，还有一点小问题，我会尽快更新（主要我这两天没卡了，训练推迟了）。如果您要跑实验，建议先不打开这个选项。

ParadoxZW commented 4 months ago

以及check了下clip的code，attention mask在clip里应该-float('inf')才是表示mask而不是0或1表示mask？

这是我从~/miniconda3/envs/llv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py的CLIPEncoder类的forward函数的注释里拷贝出来的
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
是不是我们的transformers版本不一样？
我也注意到了这个注释，但是我去看了源代码，他是把这个attention_mask直接加到了attention weight上？

还真是！！！看起来这属于官方犯的一个小bug，估计他们没料到有人想要在视觉上做mask，所以忽视了这个问题。我看了代码，CLIPEncoder的定义会同时在CLIPTextTransformer和CLIPVisionTransformer中使用。而前者的forward函数里有调用_expand_mask，这个函数似乎是会进行的正确的类型转换的；但是CLIPVisionTransformer中没有调用。但是不管怎么说，那个关于attention_mask解释的注释都不应该出现在CLIPEncoder的forward里。

再次，感谢您的真知灼见！

ParadoxZW commented 4 months ago

如果您愿意，你可以基于现在的代码版本提交您的pr修复上述问题，我可以进行merge。当然我来update也可以。

hust-nj commented 4 months ago

以及check了下clip的code，attention mask在clip里应该-float('inf')才是表示mask而不是0或1表示mask？

这是我从~/miniconda3/envs/llv/lib/python3.10/site-packages/transformers/models/clip/modeling_clip.py的CLIPEncoder类的forward函数的注释里拷贝出来的
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
                Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:

                - 1 for tokens that are **not masked**,
                - 0 for tokens that are **masked**.

                [What are attention masks?](../glossary#attention-mask)
是不是我们的transformers版本不一样？
我也注意到了这个注释，但是我去看了源代码，他是把这个attention_mask直接加到了attention weight上？
还真是！！！看起来这属于官方犯的一个小bug，估计他们没料到有人想要在视觉上做mask，所以忽视了这个问题。我看了代码，CLIPEncoder的定义会同时在CLIPTextTransformer和CLIPVisionTransformer中使用。而前者的forward函数里有调用_expand_mask，这个函数似乎是会进行的正确的类型转换的；但是CLIPVisionTransformer中没有调用。但是不管怎么说，那个关于attention_mask解释的注释都不应该出现在CLIPEncoder的forward里。

再次，感谢您的真知灼见！

嗯嗯，我也把这个提给transformer 官方的issue了，原LLAVA-UHD的code真的bug太多了，您的codebase也在复现路上给了我很大的帮助 :)

ParadoxZW commented 4 months ago

@hust-nj 您好，我更新了代码中的这两个bug。您可以再看看写的有没有问题（特别是attention的部分）

ParadoxZW commented 4 months ago

@hust-nj 微调vision encoder训练的bug已修复

aosong01 commented 4 months ago

想问一下，我更新了最新的attention mask用法，但是训练的时候loss却是0，有没有可能是attention mask的相关问题呀

ParadoxZW commented 4 months ago

@aosong01 现在这套代码在我的机器上loss是正常的。我之前遇到过这个情况，是学习率太大导致的。您试试看缩小学习率呢？

aosong01 commented 4 months ago

我尝试了减小学习率到5e-5，1e-5，1e-6，loss仍然是0。 deepspeed --master_port 49642 llava_uhd/train.py \ --deepspeed ./scripts/zero2.json \ --model_name_or_path /mnt/bn/cas/data/ckpt/vicuna-13b-v1.5 \ --version plain \ --data_path /mnt/bn/cas/data/LLaVA-pretrain/blip_laion_cc_sbu_558k.json \ --image_folder /mnt/bn/cas/data/LLaVA-pretrain/images \ --vision_tower /mnt/bn/cas/data/ckpt/clip-vit-large-patch14-336 \ --mm_projector_type mlp2x_gelu \ --tune_mm_mlp_adapter True \ --mm_vision_select_layer -2 \ --mm_use_im_start_end False \ --mm_use_im_patch_token False \ --bf16 True \ --output_dir ./checkpoints/llava-v1.5-13b-pretrain \ --num_train_epochs 1 \ --per_device_train_batch_size 16 \ --per_device_eval_batch_size 4 \ --gradient_accumulation_steps 2 \ --evaluation_strategy "no" \ --save_strategy "steps" \ --save_steps 24000 \ --save_total_limit 1 \ --learning_rate 1e-6 \ --weight_decay 0. \ --warmup_ratio 0.03 \ --lr_scheduler_type "cosine" \ --logging_steps 1 \ --tf32 True \ --model_max_length 2048 \ --gradient_checkpointing True \ --dataloader_num_workers 4 \ --lazy_preprocess True \ --report_to none

hust-nj commented 4 months ago

我finetune后的模型最后预测的token全是0，请问下您现在做测试的点数是否正常？ @ParadoxZW

ParadoxZW commented 4 months ago

我的测分基本正常（虽然感觉还没到论文的分数，但是不至于全0）

ParadoxZW / LLaVA-UHD-Better

Attention mask的计算？ #3