TIGER-AI-Lab / VLM2Vec

This repo contains the code and data for "VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks"
https://tiger-ai-lab.github.io/VLM2Vec/
Apache License 2.0
46 stars 1 forks source link

Question about the configuration #3

Open URRealHero opened 4 hours ago

URRealHero commented 4 hours ago

Q1 : I'm quite new to this field, I see you set hidden_dim to 4096 which is different from Phi3.5V's original 3072 without training again, wont this modification degrade the performance? Q2: What's more, in your model's build method, padding_side has been set to 'right', but in the modeling_Phi3V code, I found that it use left padding:

        if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
            is_padding_right = attention_mask[:, -1].sum().item() != batch_size
            if is_padding_right:
                raise ValueError(
                    "You are attempting to perform batched generation with padding_side='right'"
                    " this may lead to unexpected behaviour for Flash Attention version of Phi3. Make sure to "
                    " call `tokenizer.padding_side  = 'left'` before tokenizing the input. "
                )

What's the meaning of padding right? and why your setting works to generate

looking forward to your reply

XMHZZ2018 commented 2 hours ago

@URRealHero

Hi, Thank you for your interest in our work!

For your first question, could you please remind me where hidden_dim = 4096 is set? Sorry that I don’t recall it and thought the dimension was 3072. (I just double checked, the output dimension is 3072.)

Regarding your second question, yes, the default padding side for phi-3.5-v is left. However, as we are an embedding model and won’t be generating new tokens, the padding side should not affect the output.

URRealHero commented 7 minutes ago

thx a lot! I reviewed your code, and I found that you directly used classmethod load and build in demo.py and train.py, therefore, the output dimension is as same as the Phi3V's default config 3072. Sorry for that, I asked that because I previously saw:

class MMEBModel(nn.Module):
    TRANSFORMER_CLS = AutoModelForCausalLM

    def __init__(self,
                 encoder: PreTrainedModel,
                 pooling: str = 'cls',
                 normalize: bool = False,
                 temperature: float = 1.0,
                 ):
        super().__init__()
        self.config = encoder.config
        self.config.hidden_size = 4096
        self.hidden_size = 4096
        self.encoder = encoder
        self.pooling = pooling
        self.normalize = normalize
        self.temperature = temperature
        self.cross_entropy = nn.CrossEntropyLoss(reduction='mean')
        self.is_ddp = dist.is_initialized()
        if self.is_ddp:
            self.process_rank = dist.get_rank()
            self.world_size = dist.get_world_size()

which is not used during the inference or training.

For the second question, thx a lot, and I think it would be similar with both padding sides.

thx for your reply.