X-PLUG / mPLUG-Owl

mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model
https://www.modelscope.cn/studios/damo/mPLUG-Owl
MIT License
2.01k stars 159 forks source link

Computing output likelihoods with the model #16

Open vishaal27 opened 1 year ago

vishaal27 commented 1 year ago

Hi, is it possible to get the tokenwise log-likelihood scores of different outputs from the model?

The use-case would be something like: Given an interleaved image/text input and a list of output text candidates, we should be able to get a score for each output candidate and then return their ranked list, rather than generating the outputs directly. This would be close to how LLMs are evaluated on MCQ tasks. An example from the T0 paper Page 6 (https://arxiv.org/pdf/2110.08207.pdf):

For tasks that involve choosing the correct completion from several options (e.g. multiple choice
question answering), we follow Brown et al. (2020) and use rank classification to evaluate our
model: we compute the log-likelihood of each of the target options under the fine-tuned model and
select the option with the highest log-likelihood as the prediction. For simplicity, we do not apply
length normalization to the log-likelihoods of the target options.

Is it straightforward to do this with mPLUG-Owl? I assume since the LM is built with transformers there should be a possibility to use output score functions already implemented (haven't dug into this yet)?

vishaal27 commented 1 year ago

Hi, I tried a quick implementation to compute the output likelihoods of a given interleaved image-text token sequence:

def get_class_log_likelihoods(image_path, classes, model, tokenizer, img_processor, device='cuda', dtype=torch.bfloat16):

    img = Image.open(image_path).convert('RGB')

    image_tensor = img_processor([img]).to(dtype)
    image_tensor = image_tensor.to(device)

    class_log_likelihoods = []
    for class_name in classes:
        prompt = f'<image> A photo of a {class_name}'
        input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
        num_images_tensor = torch.tensor([1]).to(device)

        # Create non_padding_mask, non_media_mask, and prompt_mask
        non_padding_mask = (input_ids != tokenizer.pad_token_id).to(dtype)[:,:-1]
        non_media_mask = torch.ones_like(non_padding_mask).to(dtype)
        prompt_mask = torch.zeros_like(non_padding_mask).to(dtype)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, pixel_values=image_tensor, labels=input_ids, num_images=num_images_tensor,
                            non_padding_mask=non_padding_mask, non_media_mask=non_media_mask, prompt_mask=prompt_mask)
        print(outputs.loss)
        log_likelihood = -outputs.loss.item()
        class_log_likelihoods.append((class_name, log_likelihood))

    return class_log_likelihoods

However, this code gives me nan for all the loss values, and I am not sure if I am processing the tokens / inputting them into the forward correctly, could you please check what the issue with this implementation is? @MAGAer13 @LukeForeverYoung @butyuhao

LukeForeverYoung commented 1 year ago

Hi, I tried a quick implementation to compute the output likelihoods of a given interleaved image-text token sequence:

def get_class_log_likelihoods(image_path, classes, model, tokenizer, img_processor, device='cuda', dtype=torch.bfloat16):

    img = Image.open(image_path).convert('RGB')

    image_tensor = img_processor([img]).to(dtype)
    image_tensor = image_tensor.to(device)

    class_log_likelihoods = []
    for class_name in classes:
        prompt = f'<image> A photo of a {class_name}'
        input_ids = tokenizer.encode(prompt, return_tensors='pt').to(device)
        num_images_tensor = torch.tensor([1]).to(device)

        # Create non_padding_mask, non_media_mask, and prompt_mask
        non_padding_mask = (input_ids != tokenizer.pad_token_id).to(dtype)[:,:-1]
        non_media_mask = torch.ones_like(non_padding_mask).to(dtype)
        prompt_mask = torch.zeros_like(non_padding_mask).to(dtype)

        with torch.no_grad():
            outputs = model(input_ids=input_ids, pixel_values=image_tensor, labels=input_ids, num_images=num_images_tensor,
                            non_padding_mask=non_padding_mask, non_media_mask=non_media_mask, prompt_mask=prompt_mask)
        print(outputs.loss)
        log_likelihood = -outputs.loss.item()
        class_log_likelihoods.append((class_name, log_likelihood))

    return class_log_likelihoods

However, this code gives me nan for all the loss values, and I am not sure if I am processing the tokens / inputting them into the forward correctly, could you please check what the issue with this implementation is? @MAGAer13 @LukeForeverYoung @butyuhao

Elements in non_media_mask are all set to 1, resulting in a NaN loss value. You should remove the position where loss is generated from image tokens as we done during the finetuning.

See The code here

https://github.com/X-PLUG/mPLUG-Owl/blob/main/mplug_owl/modeling_mplug_owl.py#L1131

Beside, It may be better to collect the logits of each class from the model and apply softmax across them instead of using the loss value directly.

vishaal27 commented 1 year ago

Thanks for the response. I have updated the function to ensure that we don't compute the loss over the image tokens as well as ensure that the loss is only computed over the tokens corresponding to each class. This is my function now:

from mplug_owl.tokenize_utils import tokenize_prompts

def get_class_log_likelihoods(image_path, classes, model, tokenizer, img_processor, device='cuda', dtype=torch.bfloat16):

    img = Image.open(image_path).convert('RGB')

    image_tensor = img_processor([img]).to(dtype)
    image_tensor = image_tensor.to(device)

    class_log_likelihoods = []
    for class_name in classes:
        context_prompt = '<image> A photo of a'
        context_prompt_with_class = context_prompt + ' {}'.format(class_name)

        context_tokens_tensor, _, _ = tokenize_prompts(prompts=[context_prompt], tokens_to_generate=0, add_BOS=False, tokenizer=tokenizer, ignore_dist=True)
        context_tokens_with_class_tensor, _, _ = tokenize_prompts(prompts=[context_prompt_with_class], tokens_to_generate=0, add_BOS=False, tokenizer=tokenizer, ignore_dist=True)

        # get number of tokens in the class only (without the context)
        # this is used to construct the prompt_mask -- masking out the parts where
        # we don't want the model to compute the loss
        num_class_tokens = context_tokens_with_class_tensor.shape[1] - context_tokens_tensor.shape[1]

        # construct prompt_mask for computing the loss similar to as done here:
        # https://github.com/X-PLUG/mPLUG-Owl/blob/d0c9aded55a3622970166fa8a431590651fecee4/data_utils/xgpt3_dataset.py#L257
        # unsqueeze to match other mask dimensions and take from 1st token to match loss_mask in forward
        # slicing essentially removes one image token which is not computed in loss anyway
        prompt_mask = [0] * context_tokens_tensor.shape[1] + [1] * num_class_tokens
        prompt_mask = torch.tensor(prompt_mask[1:]).to(device).unsqueeze(0)

        context_tokens_with_class_tensor = context_tokens_with_class_tensor.to(device)
        context_tokens_tensor = context_tokens_tensor.to(device)

        num_images_tensor = torch.tensor([1]).to(device)

        # construct mask for all non-padding tokens
        # usually will be all 1s
        # again string slice to take from first token
        non_padding_mask = (context_tokens_with_class_tensor != tokenizer.pad_token_id).long().to(device)[:, 1:]

        # compute non media mask i.e. mask for text only tokens as done here:
        # https://github.com/X-PLUG/mPLUG-Owl/blob/d0c9aded55a3622970166fa8a431590651fecee4/data_utils/xgpt3_dataset.py#LL264C1-L268C55
        tmp_enc_chunk = context_tokens_with_class_tensor[:, 1:].clone()
        tmp_enc_chunk[tmp_enc_chunk >= 0] = 1
        tmp_enc_chunk[tmp_enc_chunk < 0] = 0
        non_media_mask = torch.tensor(tmp_enc_chunk).long()

        with torch.no_grad():
            outputs = model(input_ids=context_tokens_with_class_tensor, pixel_values=image_tensor, labels=context_tokens_with_class_tensor, num_images=num_images_tensor,
                            non_padding_mask=non_padding_mask, non_media_mask=non_media_mask, prompt_mask=prompt_mask)
        log_likelihood = -outputs.loss.item()
        class_log_likelihoods.append((class_name, log_likelihood))

    return class_log_likelihoods

This seems to be working fine for classifying simple animal images i.e. dogs vs horses vs cats etc. I assume this should work fine for ImageNet classification as well. Do you see any particular things you would flag/modify in this function or does it look good to you? @MAGAer13 @LukeForeverYoung @butyuhao