Heidelberg-NLP / VALSE

Data repository for the VALSE benchmark.
https://aclanthology.org/2022.acl-long.567/
MIT License
35 stars 3 forks source link

Release the scoring scripts? #1

Closed Wangt-CN closed 2 years ago

Wangt-CN commented 2 years ago

Hi, thanks a lot for this great work which quite inspires me! I wonder if it is possible to release the scoring scripts to reproduce any results (e.g., Table 2.) in your paper?

Looking forward to your reply!

LetiP commented 2 years ago

Thanks for your interest! :blush: I am looking into this, will get back to you soon. Do I understand correctly that you are interested in evaluation scripts, where let's say, you have output of model A and want to compare it with the ground truth / gold labels?

yu-wyatt-wu commented 2 years ago

I am also interested in evaluation scripts, and will be grateful if you can offer some! If the script can include some baseline codes like 12-in-1 (which seems to be the best), it will be better~

LetiP commented 2 years ago

Please have a look at the script that runs the LXMERT evaluation on VALSE. Hope it helps.

skshvl commented 9 months ago

Hi, I had a question about the following portion of the LXMERT script, where test_sentences contains [caption, foil]:

inputs = lxmert_tokenizer(
    test_sentences,
    padding="max_length",
    max_length=30,  # 20
    truncation=True,
    return_token_type_ids=True,
    return_attention_mask=True,
    add_special_tokens=True,
    return_tensors="pt"
)

output_lxmert = lxmert_base(
    input_ids=inputs.input_ids,
    attention_mask=inputs.attention_mask,
    visual_feats=features,
    visual_pos=normalized_boxes,
    token_type_ids=inputs.token_type_ids,
    return_dict=True,
    output_attentions=False,
)

m = torch.nn.Softmax(dim=1)
output = m(output_lxmert['cross_relationship_score'])
cross_score = output_lxmert['cross_relationship_score']

out_dict = {}
out_dict['lxmert'] = {'caption': 0, 'foil': 0} # 0 is not detected, 1 is detected
out_dict['lxmert']['caption'] = output[0, 1].item() # probability of fitting should be close to 1 for captions
out_dict['lxmert']['foil'] = output[1, 0].item() # probability of fitting, should be close to 0 for foils

Specifically, my question regards the last few lines. My understanding is that each row of the output variable will contain a probability for that caption being [incorrect, correct]. So shouldn't we then be accessing output[0,1] and output[1,1], as we are interested in the probability of caption and foil, respectively? I am concerned that accessing output [1,0] for foil, we are basically assuming that the "incorrectness probability" of foil (the first column) is its likelihood score. Since LXMERT seems to often put this first number pretty low and the correctness probability higher, this approach may risk leading to the foil will getting a low score regardless. Does that make sense?

And if so, I was also thinking that perhaps one should be doing the softmax along dimension 0, i.e. the positive likelihood scores should add up to 1 so that we can use that as a classification output score. But that is a less important issue.

Thanks!

skshvl commented 9 months ago

Related to this, the MMSHAP implementation of a very similar task actually accesses the second column of each LXMERT output score, which seem to match what I am saying: https://github.com/Heidelberg-NLP/MM-SHAP/blob/main/mm-shap_lxmert_dataset.py#L262 (line 262)

LetiP commented 9 months ago

Hi @skshvl ,

Thanks for pointing this out! Indeed, either this line of code or its description in the comment is wrong (for the code in the version you are talking about, it should say: "should be close to 1 for foils", and not 0). I've edited the line right now.

However, for computing the tables in the paper, I've used the script lines starting here and not the output dictionaries that we've been talking about above, so the accessed values and results are correct.