cmusphinx / sphinx4

Pure Java speech recognition library
cmusphinx.sourceforge.net
Other
1.4k stars 587 forks source link

SpeakerIdentificationDemo.java does not print out any hypotheses #31

Open yuraneb opened 9 years ago

yuraneb commented 9 years ago

Hello Nicolai, thank you for fixing the issue pertaining to the "singular matrix" exception. I can verify now that it's no longer an issue, the speaker segments are being properly calculated and dumped to console with printSpeakerIntervals().

However, my concern now is that it seems that the speakerAdaptiveDecoding() never prints any hypotheses. A quick look at the demo code tells me that it should be iterating over the many speaker segments and running the (adapted) recognition. I can definitely see it iterate many times (loading language/acoustic/etc models) and running speedTracker, but I never see any hypothesis output. In addition, the speedTracker reports the transcription time as "0.00 X realtime", prompting me to believe that it never actually runs after the models are loaded.

I'm running the demo "as is", on the in-package /edu/cmu/sphinx/demo/speakerid/test.wav file, but I'd be happy to try this out on some of my own media if you think it would help. I've been using Sphinx4 for over half a year now, and generally under the right conditions I feel it works pretty well, I was hoping to get a bit of an accuracy boost from the speaker adaptation.

nshmyrev commented 9 years ago

It does print hypothesis for the speaker, to suppress the log you can redirect it to null:

java -cp ... 2> /dev/null

then you will see the output

The second speaker is silence so the result for it is empty. I haven't decided what to do about it though, maybe we will ignore silence on speaker intervals first of all.

yuraneb commented 9 years ago

Thanks for the response Nickolay, you're absolutely right. That printout got lost in the logging output for me. A quick question about the silence issue here, given the current state of things. Is there a way to predict which speaker will correspond to the silence intervals (aside from having no results in the transcription for those intervals). To be more specific, lets say I have 4 participants in a conversation that I transcribe, and sphinx identifies 5 speakers (each with several segments of speaking). I would assume they would be created in the "order of appearance", is that correct? So if the conversation begins with Speaker1 talking, and then is followed by a short interval of silence, the Silence would effectively be Speaker2.

Also, based on your experience, how likely is it for a single effective speaker to be labeled as multiple speakers in sphinx (based on varying background noise & channel conditions). That is, if I'm the only one talking, is Sphinx likely to erroneously identify various speakers if there is significant variation in the background noise? I apologize if this doesn't belong in the issues, if so please tell me where to move this question/discussion.

Edit: 1 more quick question while I'm at it

I realize that the speaker identification/diarization in Sphinx is done mainly so that one can run the speaker adaptation, and thus hopefully get somewhat better accuracy from the acoustic model. Having said that, how do you feel Sphinx's speaker segmentation compares with a dedicated speaker diarization package like Lium? If I wanted to integrate this feature, would I be able to use Sphinx' SpeakerIdentifcation exclusively, or do you think there would be some benefit from using Lium's tools (LIUM_SpkDiarization).

nshmyrev commented 9 years ago

would be created in the "order of appearance", is that correct? So if the conversation begins with Speaker1 talking, and then is followed by a short interval of silence, the Silence would effectively be Speaker2.

Yes

Is there a way to predict which speaker will correspond to the silence intervals

Currently we don't have that, more advanced toolkits like LIUM Diarization have specific tools and steps to detect silence.

Also, based on your experience, how likely is it for a single effective speaker to be labeled as multiple speakers in sphinx (based on varying background noise & channel conditions).

Yes, this happens pretty frequently. Our approach is not the best one around. For adaptation it's enough to have like 20 seconds of speech to improve accuracy, so such misclassification is not really harmful on long recordings

If I wanted to integrate this feature, would I be able to use Sphinx' SpeakerIdentifcation exclusively, or do you think there would be some benefit from using Lium's tools (LIUM_SpkDiarization).

LIUM tools are certainly more advanced but they are not easy to use and also have algorithm flaws. Thats why we started our own speakerid part, but it is very far from being complete.

yuraneb commented 9 years ago

Thank you for the detailed follow-up response Nickolay. In the future, if there are any more "open-ended" questions pertaining to Sphinx, where is the most appropriate place to start such threads?

nshmyrev commented 9 years ago

You are welcome to post message to our forums on sourceforge, create an issue here, join the irc channel #cmusphinx on freenode or contact me directly at nshmyrev@gmail.com