alphacep / vosk-api

Offline speech recognition API for Android, iOS, Raspberry Pi and servers with Python, Java, C# and Node
Apache License 2.0
7.73k stars 1.08k forks source link

Recognizing emphasis, prosody, like reverse SSML #957

Closed PeturDarriPeturs closed 2 years ago

PeturDarriPeturs commented 2 years ago

This is more of a question than a request: how difficult would it be to be able to recognize where the speaker is putting emphasis or the rate, pitch and volume of each word or phoneme, a bit like reverse SSML?

This would be very useful to help discern the intent behind the speaker's sentence. Examples, where the emphasized word is highlighted:

Another example is being able to identify if a sentence sounds like a question or a statement.


Is this something that is being researched in academics? Does it have a term? I can't find anything about this.

nshmyrev commented 2 years ago

Yes, there is some research about it from Alexa team and from Apple too. Like https://www.amazon.science/publications/streaming-reslstm-with-causal-mean-aggregation-for-device-directed-utterance-detection

solyarisoftware commented 2 years ago

Hi
premising that i think that waht you are looking for (so interesting for me too) is ot-of-scope of vosk ASR functionalities.

Vosk is a speech-to-text engine in the traditional literal ASR sense: a spoken language sentence is translated in the equivalent text, "purging-out" all "non-verbose" content parts of the audio speech (I call "metadata").

I don't know the exact academic terms for what you are looking for, but in in the linguistics realms of prosody.

Another example is being able to identify if a sentence sounds like a question or a statement.

This seems to me a prosody/intonation classifier. BTW, this is an underrated field of ASR research, IMMO.

A possible solution for a simplified tone detection is to couple a standard ASR (like great Vosk) with an independent tone classifier, build with some usual ML approach.

See: https://en.wikipedia.org/wiki/Prosody_(linguistics) https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2600436/pdf/nihms49798.pdf https://journals.uic.edu/ojs/index.php/dad/article/view/11392/10640

PeturDarriPeturs commented 2 years ago

@nshmyrev

Yes, there is some research about it from Alexa team and from Apple too. Like https://www.amazon.science/publications/streaming-reslstm-with-causal-mean-aggregation-for-device-directed-utterance-detection

This does address my "Hey Siri" example, but this is a binary classifier for that specific use-case, whereas I'm thinking of a more generalized solution that can detect some of the original prosody/tone of the speech to help with understanding.

@solyarisoftware

premising that i think that waht you are looking for (so interesting for me too) is ot-of-scope of vosk ASR functionalities.

I definitely agree, but this seemed like a good place to ask.

I don't know the exact academic terms for what you are looking for, but in in the linguistics realms of prosody.

I didn't think to use that term in my search. I'm able to find some papers about this topic now, thank you!

For context, I am trying to improve the natural language understanding in a virtual character that the user can speak to. It seems most voice assistants simply convert the speech to text, losing a lot of this "metadata" like you mentioned, and therefore at a disadvantage when trying to understand the intent.

My idea was that a traditional text-based NLU, like Rasa NLU, could be trained with text that has SSML like tags injected into it to represent the prosody, but that would require the speech recognition to generate those tags.

solyarisoftware commented 2 years ago

My idea was that a traditional text-based NLU, like Rasa NLU, could be trained with text that has SSML like tags injected into it to represent the prosody, but that would require the speech recognition to generate those tags.

By example in my language (Italian) a common ASr problem (Vosk doesn't solve) is to distinguisc a statement utterance froma a question utterance.

statement utterance (you are asking me...):

Me lo stai chiedendo! ...

from "question" utterance (are you asking me ...?):

Me lo stai chiedendo?

Yes. As far as I remember RASA intent/entity classifier ("NLU") could accept optional metadata ( ~= tags as you said) associated with utterance examples. And yes, these matadata must be generated in advance, at the ASR level.

nshmyrev commented 2 years ago

My idea was that a traditional text-based NLU, like Rasa NLU, could be trained with text that has SSML like tags injected into it to represent the prosody, but that would require the speech recognition to generate those tags.

While engineering tags is hard, there are so-called end-to-end spoken language understanding systems which also gets some interest recently. With end-to-end training they are able to pick intonation aspects:

https://arxiv.org/abs/2102.06283

abusaadp commented 2 weeks ago

Any update on this?