alumae / kaldi-gstreamer-server

Real-time full-duplex speech recognition server, based on the Kaldi toolkit and the GStreamer framwork.
BSD 2-Clause "Simplified" License
1.07k stars 341 forks source link

How can this be modified to get utterance time, word order, word start / stop time ? #95

Closed mike-a-ellis closed 7 years ago

mike-a-ellis commented 7 years ago

How can this be modified to get utterance time, word order, word start / stop time ?

Something similar in scope to what Dragon might give you :

<?xml version="1.0" encoding="windows-1252"?> <!DOCTYPE BODY SYSTEM "http://www.nuance.com/naturallyspeaking/dss/dtd/dss-idxv2.dtd"> http://www.nuance.com/naturallyspeaking/dss/dtd/dss-idxv2.dtd

<BODY>
  <RENDERING>
    <TRN Type="IC">
      <INP>
        <REF WRD_Id="1">at</REF>
      </INP>
      <OUT>At</OUT>
    </TRN>
    <SPC> </SPC><REF WRD_Id="2">the</REF><SPC> </SPC><REF WRD_Id="3">register</REF>
    <TRN Type="UNK">
      <INP>
        <REF WRD_Id="4">,\guessed</REF>
      </INP>
      <OUT>,</OUT>
    </TRN>
    <SPC> </SPC><REF WRD_Id="5">but</REF><SPC> </SPC><REF WRD_Id="6">there</REF><SPC> </SPC><REF WRD_Id="7">were</REF>
    <SPC> </SPC><REF WRD_Id="8">no</REF><SPC> </SPC><REF WRD_Id="9">criminal</REF><SPC> </SPC>
    <REF WRD_Id="10">penalties</REF><SPC> </SPC><REF WRD_Id="11">at</REF><SPC> </SPC><REF WRD_Id="12">that</REF>
    <SPC> </SPC><REF WRD_Id="13">point</REF><SPC> </SPC><REF WRD_Id="14">was</REF><SPC> </SPC>
.......
    <UTT Id="16" Start="53.570" End="58.880">
      <WRD Id="127" Start="53.570" End="54.130" C="893" SRT="had" />
      <WRD Id="128" Start="54.130" End="54.410" C="885" SRT="been" />
      <WRD Id="129" Start="54.410" End="55.330" C="918" SRT="convicted" />
      <WRD Id="130" Start="55.330" End="55.450" C="848" SRT="of" />
      <WRD Id="131" Start="55.450" End="56.230" C="784" SRT="expectancies" />
      <WRD Id="132" Start="56.230" End="56.710" C="823" SRT="at" />
      <WRD Id="133" Start="56.710" End="57.070" C="908" SRT="four" />
      <WRD Id="134" Start="57.070" End="57.970" C="955" SRT="nineteen" />
      <WRD Id="135" Start="57.970" End="58.290" C="951" SRT="ninety" />
      <WRD Id="136" Start="58.290" End="58.880" C="951" SRT="six" />
    </UTT>

I understand this has not been implemented and would be interested in doing so, but I need a bit of guidance...like what classes to research.

I have been able to get it to debug, so I feel like I have a shot. The challenge to Kaldi is the learning curve is a brick wall, so am not sure where to focus.

alumae commented 7 years ago

This is already supported. You have to set the property word-boundary-file (as outcommented in https://github.com/alumae/kaldi-gstreamer-server/blob/master/sample_english_nnet2.yaml). Then the JSON encoding of the final result will include word start and end times, something like:

{
   "status":0,
   "segment-start":0.0,
   "segment-length":6.12,
   "total-length":6.12,
   "result":{
      "hypotheses":[
         {
            "transcript":"one two three four five six seven eight.",
            "confidence":10000000000.0,
            "likelihood":153.665,
            "word-alignment":[
               {
                  "start":1.18,
                  "length":0.43,
                  "word":"one",
                  "confidence":1.0
               },
               {
                  "start":1.65,
                  "length":0.29,
                  "word":"two",
                  "confidence":0.989745
               },
               {
                  "start":1.97,
                  "length":0.4,
                  "word":"three",
                  "confidence":1.0
               },
               {
                  "start":2.37,
                  "length":0.53,
                  "word":"four",
                  "confidence":1.0
               },
               {
                  "start":2.9,
                  "length":0.39,
                  "word":"five",
                  "confidence":1.0
               },
               {
                  "start":3.29,
                  "length":0.4,
                  "word":"six",
                  "confidence":1.0
               },
               {
                  "start":3.69,
                  "length":0.43,
                  "word":"seven",
                  "confidence":1.0
               },
               {
                  "start":4.28,
                  "length":0.24,
                  "word":"eight",
                  "confidence":0.991182
               }
            ]
         }
      ],
      "final":true
   },
   "segment":0,
   "id":"e887d790-b321-47ae-ae7b-13276b1b3fcd"
}
mike-a-ellis commented 7 years ago

I am unable to get this to work. I am using /client/dynamic/recognize

Is there anything I need to do besides uncomment out the "word-boundary-file" ?

alumae commented 7 years ago

The extended results are only available through the the websocket-based interface.

mike-a-ellis commented 7 years ago

Can you offer some advice on how to implement this. Any advice would be appreciated!

On Thu, Oct 5, 2017 at 2:32 AM, Tanel Alumäe notifications@github.com wrote:

The extended results are only available through the the websocket-based interface.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alumae/kaldi-gstreamer-server/issues/95#issuecomment-334371572, or mute the thread https://github.com/notifications/unsubscribe-auth/AGPc5qKdKCeJqvSg7vFsmEWPXwgQ-Funks5spHf6gaJpZM4Pi-UP .