OpenNMT / CTranslate2

Fast inference engine for Transformer models
https://opennmt.net/CTranslate2
MIT License
3.3k stars 289 forks source link

Feature request: Word level scores #1077

Open ItakeLs opened 1 year ago

ItakeLs commented 1 year ago

Hi, currently the decoder produces sentence level scores, instead of just outputting the average another option would be produce the score of each word/token.

Beam search might be a harder task to calculate a meaningful individual score, but for greedy I am assuming its just a matter of not averaging the scores together, please correct me if I'm wrong.

ItakeLs commented 1 year ago

Hello, so I have managed to (I think) add scores for each token in greedy search, still need to do some testings, now I am trying to do the same for beam search.

In whisper.h I changed the results to have an extra variable for the token scores

struct WhisperGenerationResult {
      std::vector<std::vector<std::string>> sequences;
      std::vector<std::vector<size_t>> sequences_ids;
      std::vector<float> scores;
      std::vector<float> token_scores;

In the decoding.cc, around line 570:

// Register this hypothesis.
const StorageView& scores = ignore_last_score ? topk_scores_prev : topk_scores;
result.scores.emplace_back(scores.scalar_at<float>({i, k}));

# my addition
result.token_scores.emplace_back(scores);

result.hypotheses.emplace_back(build_hypothesis(alive_seq, i, k, ignore_last_token));
if (alive_attention)
   result.attention.emplace_back(build_attention(alive_attention, i, k, ignore_last_token));

I am trying to understand how the scores are stored and how I can get the score for each token instead of the combined score.

guillaumekln commented 1 year ago

You need to store the topk_scores of each step somewhere, similar to how topk_ids is stored in alive_seq.

Note that the token scores are always increasing because they are accumulated. So you have to compute the adjacent different to get the actual score of a token.

ItakeLs commented 1 year ago

thanks that makes sense, I will try that, also do you have any resources on where I can learn about computing the "adjacent different" because I am not sure what that means.

for the greedy search I made couple of changes but essentially the main change was this

        if (word_id != end_id || include_eos_in_scores) {
          if (return_scores){
            results[batch_id].scores[0] += best_probs.scalar_at<float>({i, 0});
            results[batch_id].token_scores[0].push_back(best_probs.scalar_at<float>({i, 0}));
          }
        }

Is the best_probs scalar the normalized log prob for that specific token?

guillaumekln commented 1 year ago

Sorry I meant "adjacent difference". It just means that to get the actual score of the token at step i, you need to subtract the score of the previous step i-1.

Is the best_probs scalar the normalized log prob for that specific token?

Yes.

ItakeLs commented 1 year ago

Apologies for the constant questions, C++ is definitely not my language. I've got something working and I am getting log probs, however, I am also getting log probs greater than zero.

I added two new variables to keep track of the topk_score

    StorageView logits(dtype, device);
    StorageView alive_seq(topk_ids.dtype());
    // Store the actual token score
    StorageView alive_seq_scores(topk_scores.dtype());
    // Keep track of the previous token score to prevent accumulated token scores
    StorageView alive_seq_scores_prev;
    StorageView alive_attention;

I then append the non-accumulated topk_scores

// Append last prediction.
append_step_output(alive_seq, topk_ids, &gather_indices);

if(step == 0)
// No accumulated scores since its the first step
// Append last topk_scores
append_step_output(alive_seq_scores, topk_scores, &gather_indices);
else{
// Convert the storage views into vectors
std::vector<float> topk_scores_vec = topk_scores.to_vector<float>();
std::vector<float> alive_seq_scores_prev_vec = alive_seq_scores_prev.to_vector<float>();
std::vector<float> actual_topk_scores_vec(topk_scores_vec.size());

// Calculate the adjancent difference to get the actual score topk_scores instead of accumulated scores
std::transform(topk_scores_vec.begin(), topk_scores_vec.end(), alive_seq_scores_prev_vec.begin(), actual_topk_scores_vec.begin(), std::minus<float>());

// Append last non accumulated topk_scores
StorageView actual_topk_scores(topk_scores.shape(), actual_topk_scores_vec);
append_step_output(alive_seq_scores, actual_topk_scores, &gather_indices);
}

// Keep track of the previous score so we can calculate the adjacent different
alive_seq_scores_prev = topk_scores;

I then make sure to add the topk_scores into result.token_scores

// Register this hypothesis.
const StorageView& scores = ignore_last_score ? topk_scores_prev : topk_scores;
result.scores.emplace_back(scores.scalar_at<float>({i, k}));
// add the token scores
result.token_scores.emplace_back(build_topk_scores(alive_seq_scores, i, k, ignore_last_token));
result.hypotheses.emplace_back(build_hypothesis(alive_seq, i, k, ignore_last_token));
if (alive_attention)
  result.attention.emplace_back(build_attention(alive_attention, i, k, ignore_last_token));

and I build the topk_scores similar to the way the alive_seq was built

static std::vector<float> build_topk_scores(const StorageView& history,
                                          const dim_t batch,
                                          const dim_t beam,
                                          const bool ignore_last) {
const auto length = history.dim(-1) - dim_t(ignore_last);
const auto* ids = history.index<float>({batch, beam, 0});
return std::vector<float>(ids, ids + length);
}

Just want some clarification if you think I am going in the right direction, since I'm getting log probs but some are also greater than zero.

guillaumekln commented 1 year ago

Your approach looks fine overall, but I'm not sure the adjacent difference is correct. I suggest to not bother with this adjacent difference for now and only try to gather the topk_scores at each step.

ItakeLs commented 1 year ago

Thank you a lot for the help, I think I can figure out the adjacent difference by myself eventually, once I get it working I will love to contribute by adding this feature.

guillaumekln commented 1 year ago

Let's keep this issue open as this is something we would like to add.

@ItakeLs Are you still planning to contribute this feature? If not, I may work on this for the next version.

ItakeLs commented 1 year ago

I am still planning to contribute this feature, but I didn't because I don't have the proper adjacent difference so the beam search word level scores are incorrect (log probs greater than zero), and I did not think the cumulative token scores were indicative of the token scores.

MrigankRaman commented 1 year ago

Any news on this? I would like to get the highest logprob token during the generation.

guillaumekln commented 1 year ago

Greedy search (beam_size=1) already returns the tokens with the highest log prob. Is this what you are looking for?