Open ItakeLs opened 1 year ago
Hello, so I have managed to (I think) add scores for each token in greedy search
, still need to do some testings, now I am trying to do the same for beam search.
In whisper.h
I changed the results to have an extra variable for the token scores
struct WhisperGenerationResult {
std::vector<std::vector<std::string>> sequences;
std::vector<std::vector<size_t>> sequences_ids;
std::vector<float> scores;
std::vector<float> token_scores;
In the decoding.cc
, around line 570
:
// Register this hypothesis.
const StorageView& scores = ignore_last_score ? topk_scores_prev : topk_scores;
result.scores.emplace_back(scores.scalar_at<float>({i, k}));
# my addition
result.token_scores.emplace_back(scores);
result.hypotheses.emplace_back(build_hypothesis(alive_seq, i, k, ignore_last_token));
if (alive_attention)
result.attention.emplace_back(build_attention(alive_attention, i, k, ignore_last_token));
I am trying to understand how the scores are stored and how I can get the score for each token instead of the combined score.
You need to store the topk_scores
of each step somewhere, similar to how topk_ids
is stored in alive_seq
.
Note that the token scores are always increasing because they are accumulated. So you have to compute the adjacent different to get the actual score of a token.
thanks that makes sense, I will try that, also do you have any resources on where I can learn about computing the "adjacent different" because I am not sure what that means.
for the greedy search I made couple of changes but essentially the main change was this
if (word_id != end_id || include_eos_in_scores) {
if (return_scores){
results[batch_id].scores[0] += best_probs.scalar_at<float>({i, 0});
results[batch_id].token_scores[0].push_back(best_probs.scalar_at<float>({i, 0}));
}
}
Is the best_probs
scalar the normalized log prob for that specific token?
Sorry I meant "adjacent difference". It just means that to get the actual score of the token at step i
, you need to subtract the score of the previous step i-1
.
Is the best_probs scalar the normalized log prob for that specific token?
Yes.
Apologies for the constant questions, C++ is definitely not my language. I've got something working and I am getting log probs, however, I am also getting log probs greater than zero.
I added two new variables to keep track of the topk_score
StorageView logits(dtype, device);
StorageView alive_seq(topk_ids.dtype());
// Store the actual token score
StorageView alive_seq_scores(topk_scores.dtype());
// Keep track of the previous token score to prevent accumulated token scores
StorageView alive_seq_scores_prev;
StorageView alive_attention;
I then append the non-accumulated topk_scores
// Append last prediction.
append_step_output(alive_seq, topk_ids, &gather_indices);
if(step == 0)
// No accumulated scores since its the first step
// Append last topk_scores
append_step_output(alive_seq_scores, topk_scores, &gather_indices);
else{
// Convert the storage views into vectors
std::vector<float> topk_scores_vec = topk_scores.to_vector<float>();
std::vector<float> alive_seq_scores_prev_vec = alive_seq_scores_prev.to_vector<float>();
std::vector<float> actual_topk_scores_vec(topk_scores_vec.size());
// Calculate the adjancent difference to get the actual score topk_scores instead of accumulated scores
std::transform(topk_scores_vec.begin(), topk_scores_vec.end(), alive_seq_scores_prev_vec.begin(), actual_topk_scores_vec.begin(), std::minus<float>());
// Append last non accumulated topk_scores
StorageView actual_topk_scores(topk_scores.shape(), actual_topk_scores_vec);
append_step_output(alive_seq_scores, actual_topk_scores, &gather_indices);
}
// Keep track of the previous score so we can calculate the adjacent different
alive_seq_scores_prev = topk_scores;
I then make sure to add the topk_scores
into result.token_scores
// Register this hypothesis.
const StorageView& scores = ignore_last_score ? topk_scores_prev : topk_scores;
result.scores.emplace_back(scores.scalar_at<float>({i, k}));
// add the token scores
result.token_scores.emplace_back(build_topk_scores(alive_seq_scores, i, k, ignore_last_token));
result.hypotheses.emplace_back(build_hypothesis(alive_seq, i, k, ignore_last_token));
if (alive_attention)
result.attention.emplace_back(build_attention(alive_attention, i, k, ignore_last_token));
and I build the topk_scores
similar to the way the alive_seq
was built
static std::vector<float> build_topk_scores(const StorageView& history,
const dim_t batch,
const dim_t beam,
const bool ignore_last) {
const auto length = history.dim(-1) - dim_t(ignore_last);
const auto* ids = history.index<float>({batch, beam, 0});
return std::vector<float>(ids, ids + length);
}
Just want some clarification if you think I am going in the right direction, since I'm getting log probs but some are also greater than zero.
Your approach looks fine overall, but I'm not sure the adjacent difference is correct. I suggest to not bother with this adjacent difference for now and only try to gather the topk_scores
at each step.
Thank you a lot for the help, I think I can figure out the adjacent difference by myself eventually, once I get it working I will love to contribute by adding this feature.
Let's keep this issue open as this is something we would like to add.
@ItakeLs Are you still planning to contribute this feature? If not, I may work on this for the next version.
I am still planning to contribute this feature, but I didn't because I don't have the proper adjacent difference so the beam search word level scores are incorrect (log probs greater than zero), and I did not think the cumulative token scores were indicative of the token scores.
Any news on this? I would like to get the highest logprob token during the generation.
Greedy search (beam_size=1
) already returns the tokens with the highest log prob. Is this what you are looking for?
Hi, currently the decoder produces sentence level scores, instead of just outputting the average another option would be produce the score of each word/token.
Beam search might be a harder task to calculate a meaningful individual score, but for greedy I am assuming its just a matter of not averaging the scores together, please correct me if I'm wrong.