espnet / espnet_model_zoo

ESPnet Model Zoo
Apache License 2.0
241 stars 39 forks source link

How to get the decoding result scores from #42

Open pengcheng-tech opened 3 years ago

pengcheng-tech commented 3 years ago

Hi,

Thanks for the work. I am trying to use the pre-trained model, but I don't know how to get the decoding score for the corresponding decoding results.

nbests = speech2text(speech)

text, *_ = nbests[0]

print(text)

The code above only prints text. I would like to get decoding confidence as well.

I checked speech2text class.

for hyp in nbest_hyps:
            assert isinstance(hyp, Hypothesis), type(hyp)

            # remove sos/eos and get results
            token_int = hyp.yseq[1:-1].tolist()

            # remove blank symbol id, which is assumed to be 0
            token_int = list(filter(lambda x: x != 0, token_int))

            # Change integer-ids to tokens
            token = self.converter.ids2tokens(token_int)

            if self.tokenizer is not None:
                text = self.tokenizer.tokens2text(token)
            else:
                text = None
            results.append((text, token, token_int, hyp))

        assert check_return_type(results)
        return results

From the code above I conjecture that the confidence should be obtained from the "hyp", but it is not clear to me how to parse "hyp" to get the score.

kamo-naoyuki commented 3 years ago

Hypothesis is a NamedTuple object. You can refer attributes.

https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search.py#L19-L33

pengcheng-tech commented 3 years ago

Hypothesis is a NamedTuple object. You can refer attributes.

https://github.com/espnet/espnet/blob/master/espnet/nets/beam_search.py#L19-L33

Hi, thanks for your response.

By referring to the link. I modified the code as follows:

nbests = speech2text(speech)

text, *_, score_bundle = nbests[0]

By executing the following:

print(score_bundle.score)
print(score_bundle.scores)

I got : tensor(-57.1623, device='cuda:0') {'decoder': tensor(-2.6879, device='cuda:0'), 'lm': tensor(-55.0374, device='cuda:0'), 'ctc': tensor(-0.8112, device='cuda:0')}

I think the number "-57.1623" is the the result of log P_encdec(y|x) + log P_ctc(y|x) + log P_lm(y), where log P_encdec(y|x) is -2.6879, log P_ctc(y|x) is -0.8112 and log P_lm(y) is -55.0374, a bit mismatch though...

If I denote -57.1623 as nbests[0].score Can I just grab nbests[0] until nbests[100], and using nbests[0].score/ (nbests[0].score + nbests[1].score + ...+ nbests[100].score) to roughly obtain the decoding confidence score?

Thanks a lot

kamo-naoyuki commented 3 years ago

score is the weighted sum of scores. You need to decide the weight when instantiation of Speech2Text class.

You can get the arbitrary n-best scores by giving nbest argument to Speech2Text, but I think it's not trivial to regard it as the confidence score.

pengcheng-tech commented 3 years ago

Thanks for the comment.

I currently treat the "score" (i.e., -57.1623) as a rough confidence score to indicate how confident the model predicts the semantic meaning of the audio is so. From my observation, the score of nbests[0] is higher than that of the nbests[1]. I guess it is adequate for my purpose.