TideDancer / interspeech21_emotion

99 stars 20 forks source link

How to use the model to do ASR? #14

Closed padmalcom closed 1 year ago

padmalcom commented 1 year ago

Hi, if I understand correctly, the model architecture is able to do a multi-modal speech recognition in combination with emotion detection, right? If yes, is the logits_ctc I get when predicting on a new audio file responsible for the asr? If yes, how can I realize ASR to get an actual text back? Thank you for your help.

padmalcom commented 1 year ago

I solved this one.

logits_ctc, logits_cls = predictions
pred_ids_ctc = np.argmax(logits_ctc, axis=-1)
pred_str = processor.batch_decode(pred_ids_ctc, output_word_offsets=True)
print("pred text: ", pred_str)