Question about retrieval evaluation

billkunghappy commented 4 years ago

Hi, I have some questions about retrieval evaluation of the response generation. I'm trying to write the retrieval evaluation scripts for mm_dst to evaluate reponse text, hope to clarify some details. I can understand this part of the code.

def evaluate_response_retrieval(gt_responses, model_scores):
    """Evaluates response retrieval using the raw data and model predictions.
    """
    # NOTE: Update this later to include gt_index for candidates.
    gt_ranks = []
    for model_datum in model_scores:
        for _round_id, round_datum in enumerate(model_datum["candidate_scores"]):
            gt_score = round_datum[0]
            gt_ranks.append(np.sum(np.array(round_datum) > gt_score) + 1)
            # Best: all other < gt, like -40 < -20
    gt_ranks = np.array(gt_ranks)
    return {
        "r1": np.mean(gt_ranks <= 1),
        "r5": np.mean(gt_ranks <= 5),
        "r10": np.mean(gt_ranks <= 10),
        "mean": np.mean(gt_ranks),
        "mrr": np.mean(1 / gt_ranks)
    }

The question is about how each score( round_datum[i], i in [ 0,len(round_datum) ) ) is calculated. As I understood, there are 100 candidate for each turn in a dialogue.

The first problem:

Is one candidate's(one candidate in one turn in one dialogue) score calculated by using it as both input and target to fed in the model and calculated by the cross_entropy_loss?
Or use a candidate as the input but the target is the ground truth?

The second problem:

When a candidate sentence is fed into the model to generate, I suppose it is fed into the model like using teacher forcing(Without using the model output word as the next input)?

satwikkottur commented 4 years ago

Hello @billkunghappy

Thanks for your interest.

First Problem: Each turn in a dialog contains 100 candidates that need to be scored and ranked. At test time, you would not know the "ground truth" candidate and thus need to score each candidate independently. Further, the scoring function to use is completely up to you. Using cross_entropy_loss is one way when training the model as a conditional language model. For this choice of scoring function, you have to use the candidate as both the input and target as you have no knowledge of the "ground truth".

Second Problem: During retrieval, one does not feed the candidate sentence to "generate" it but to score its likelihood under the model. If this is what you're talking about, then you feed the actual candidate tokens (not those predicted by the model) ~~ground truth tokens~~ to obtain the probability of the next token in the candidate given the previous ones (teacher forcing).

Hope this answers your questions.

P.S.: Edited to avoid overload of the word "ground truth".

billkunghappy commented 4 years ago

Thanks @satwikkottur For the second problem, you said to feed the ground truth token to obtain the probability of the next token But in first problem, you said At test time, you would not know the "ground truth" candidate The question is that since we don't have the ground truth during testing, how are we able to feed the ground truth into the model and acquire the candidate's probability?

satwikkottur commented 4 years ago

Hello @billkunghappy ,

I edited the above comment to add more clarity. Hope this addresses your question.

facebookresearch / simmc

Question about retrieval evaluation #17

The first problem:

The second problem: