Unable to understand the labels in preprocess_logits_for_metrics

SAI990323 / TALLRec

Apache License 2.0

199 stars 32 forks source link

Unable to understand the labels in preprocess_logits_for_metrics #55

Closed millenniumbismay closed 5 months ago

millenniumbismay commented 6 months ago

Somehow, I am seeing -100 being appended to the ground truth labels inside the preprocess_logits_for_metrics which can not be decoded back to string by tokenizer.batch_decode(). Just to make sure, train_on_inputs = True and hence, the following block of code doesn't run -

if not train_on_inputs:
    user_prompt = generate_prompt({**data_point, "output": ""})
    tokenized_user_prompt = tokenize(user_prompt, add_eos_token=False)
    user_prompt_len = len(tokenized_user_prompt["input_ids"])

    tokenized_full_prompt["labels"] = [
        -100
    ] * user_prompt_len + tokenized_full_prompt["labels"][
        user_prompt_len:
    ]  # could be sped up, probably

I tested it out by commenting this part and still the labels have -100. Could anyone explain please? I can remove them manually but I do not understand why they come in the first place.

millenniumbismay commented 6 months ago

I am using meta-llama/Llama-2-7b-chat-hf as the base model just if it could shed some light.

millenniumbismay commented 6 months ago

I got it... It is automatically added by transformers.DataCollatorForSeq2Seq() which is ignored by PyTorch loss functions but creates a problem when trying to decode back. Can you shed some light on if we can convert the logits in preprocess_logits_for_metrics to labels to text using the following code -

logits = logits.softmax(dim=-1)
predicted_labels = torch.argmax(logits, dim=-1)
print("Predicted:", tokenizer.batch_decode(predicted_labels, skip_special_tokens=False, clean_up_tokenization_spaces=True))

SAI990323 commented 6 months ago

-100 is the default mask ID in PyTorch’s CrossEntropyLoss that does not compute the loss. Therefore, if you fill in -100, the loss at that position will not be computed.