Hi, if I understand correctly, the model architecture is able to do a multi-modal speech recognition in combination with emotion detection, right? If yes, is the logits_ctc I get when predicting on a new audio file responsible for the asr? If yes, how can I realize ASR to get an actual text back?
Thank you for your help.
Hi, if I understand correctly, the model architecture is able to do a multi-modal speech recognition in combination with emotion detection, right? If yes, is the logits_ctc I get when predicting on a new audio file responsible for the asr? If yes, how can I realize ASR to get an actual text back? Thank you for your help.