Open bernardohenz opened 3 years ago
Thanks for opening! Did you also make parallel changes to the PathTrie to go with this scoring change here? Could you share them as well so we can have the same starting point?
I have experimented with some changes, but as soon I changed the scorer, I undo the changes on PathTrie.
But if I am not mistaken, I just changed to return a path even when not finding on dictionary, as This code is inside get_path_trie
if (has_dictionary_) {
matcher_->SetState(dictionary_state_);
bool found = matcher_->Find(new_char + 1);
PathTrie* new_path = new PathTrie;
new_path->character = new_char;
new_path->timestep = new_timestep;
new_path->parent = this;
new_path->dictionary_ = dictionary_;
new_path->has_dictionary_ = true;
new_path->matcher_ = matcher_;
new_path->log_prob_c = cur_log_prob_c;
// set spell checker state
// check to see if next state is final
auto FSTZERO = fst::TropicalWeight::Zero();
auto final_weight = dictionary_->Final(dictionary_state_);
if (found)
final_weight = dictionary_->Final(matcher_->Value().nextstate);
bool is_final = (final_weight != FSTZERO);
if ((is_final && reset) || (!found)) {
// restart spell checker at the start state
new_path->dictionary_state_ = dictionary_->Start();
} else {
// go to next state
new_path->dictionary_state_ = matcher_->Value().nextstate;
}
children_.push_back(std::make_pair(new_char, new_path));
return new_path;
} else { .....
Hi,
me and my team use STT, for Brazilian Portuguese, and we were having problems when dealing with consecutive OOV (out-of-vocabulary) words. The problem was that, when receiving two or more OOV words, the decoder enters in a state that stop accepting any other word.
After some experimentation, I've taken out the return of
OOV_SCORE
(in https://github.com/coqui-ai/STT/blob/main/native_client/ctcdecode/scorer.cpp#L247), but adding a penalization together with theBaseScore
as follows:I believe there could be a better solution for this, thus I am opening this issue for discussing a solution.
As your LM is built over a huuge corpus, I suppose that your models do not suffer from OOV words, but I believe that many people may have problems with OOV words with LMs built over smaller corpus.