Unify the different types of outputs under the same signal

alumae / gst-kaldi-nnet2-online

GStreamer plugin around Kaldi's online neural network decoder

Apache License 2.0

185 stars 100 forks source link

Unify the different types of outputs under the same signal #19

Closed rikrd closed 9 years ago

rikrd commented 9 years ago

Recently there have been a few merge requests (#14 and #18) that add new types of results (by emitting new signals).

It may be interesting to consider unifying the different types of results (phone alignments, word transcriptions,...) in a single signal, since they are all part of the same output. This would avoid the user of the plugin to keep track of what signals belong to the same decoded output.

I don't have any concrete proposal currently, but I would like to start the discussion.

amitbeka commented 9 years ago

I tend to agree with this. I think there should be two signals: one with only the best transcription (final-result), and one with detailed data:

{ 'results': [
    {'transcription': 'hello world', 'likelihood': 0.9, 'alignments': ... }
    ...
  ]
  'processing_time': 5.3,
  'other_general_result_data': ...
}

alumae commented 9 years ago

I agree. Generating JSON is easy, so we don't even need a JSON library. I'll try to merge Amit's n-best PR to a separate branch and implement it there.

alumae commented 9 years ago

Richard, do you really need phone alignments for partial results? I cannot think of a practical usage for this. I want to merge final phone alignments to full results signal and scrap the partial phone alignments.

rikrd commented 9 years ago

Well, currently there is no practical application, other than having feedback of what the decoding is doing.

I guess, for now it is ok to leave it out.

But if there is no technical impediment to have them in the future, I'll be happy to add them back later together with a parameter that allows to control whether we want them or not.

On Thu, May 28, 2015 at 10:46 AM, Tanel Alumäe notifications@github.com wrote:

Richard, do you really need phone alignments for partial results? I cannot think of a practical usage for this. I want to merge final phone alignments to full results signal and scrap the partial phone alignments.

— Reply to this email directly or view it on GitHub https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-106251206 .

ricard http://twitter.com/ricardmp http://www.ricardmarxer.com http://www.caligraft.com

alumae commented 9 years ago

I implemented the JSON results in the branch full-final-result-branch. Contrary to what I wrote before, I decided to use JSON library (Jansson). That's because otherwise we would have to do the character escaping ourselves.

Also, got rid of the partial phone alignments signal.

So, the full-final-result signal now outputs something like:

{
  "text": "one two three four five six seven eight",
  "num_frames": 615,
  "likelihood": 149.29747009277,
  "likelihood_per_frame": 0.24276011397199,
  "phone_alignment": "SIL 1.21\nHH_B 0.09\nW_I 0.08\nAH_I 0.09\nN_E 0.14\nSIL 0.04\nT_B 0.13\nUW_E 0.18\nTH_B 0.17\nR_I 0.06\nIY_E 0.18\nF_B 0.2\nAO_I 0.27\nR_E 0.06\nF_B 0.19\nAY_I 0.15\nV_E 0.05\nS_B 0.18\nIH_I 0.06\nK_I 0.12\nS_E 0.05\nS_B 0.1\nEH_I 0.07\nV_I 0.05\nAH_I 0.06\nN_E 0.14\nSIL 0.16\nEY_B 0.2\nT_E 0.04\nSIL 1.63\n",
  "nbest_results": [
    {
      "text": "one two three four five six seven eight",
      "likelihood": 149.29747009277
    },
    {
      "text": "one two three four five six seven eight it",
      "likelihood": 148.27366638184
    },
    {
      "text": "one two three four five six seven eight and",
      "likelihood": 148.16448974609
    },
    {
      "text": "one two three four five six seven eight ten",
      "likelihood": 147.72589111328
    },
    {
      "text": "one two three four five six seven eight two",
      "likelihood": 147.65536499023
    },
    {
      "text": "one two three four five six seven a m",
      "likelihood": 147.52960205078
    },
    {
      "text": "one two three four or five six seven eight",
      "likelihood": 147.37252807617
    },
    {
      "text": "one two three four five six seven a day",
      "likelihood": 147.34791564941
    },
    {
      "text": "one two three four five six seven eight to",
      "likelihood": 147.19779968262
    },
    {
      "text": "one two three four five six seven eight the",
      "likelihood": 146.8786315918
    }
  ]
}

Do you have any suggestions?

amitbeka commented 9 years ago

looks very good to me, thanks!

On Thu, May 28, 2015 at 4:19 PM, Tanel Alumäe notifications@github.com wrote:

I implemented the JSON results in the branch full-final-result-branch. Contrary to what I wrote before, I decided to use JSON library (Jansson). That's because otherwise we would have to do the character escaping ourselves.

Also, got rid of the partial phone alignments signal.

So, the full-final-result signal now outputs something like:

{ "text": "one two three four five six seven eight", "num_frames": 615, "likelihood": 149.29747009277, "likelihood_per_frame": 0.24276011397199, "phone_alignment": "SIL 1.21\nHH_B 0.09\nW_I 0.08\nAH_I 0.09\nN_E 0.14\nSIL 0.04\nT_B 0.13\nUW_E 0.18\nTH_B 0.17\nR_I 0.06\nIY_E 0.18\nF_B 0.2\nAO_I 0.27\nR_E 0.06\nF_B 0.19\nAY_I 0.15\nV_E 0.05\nS_B 0.18\nIH_I 0.06\nK_I 0.12\nS_E 0.05\nS_B 0.1\nEH_I 0.07\nV_I 0.05\nAH_I 0.06\nN_E 0.14\nSIL 0.16\nEY_B 0.2\nT_E 0.04\nSIL 1.63\n", "nbest_results": [ { "text": "one two three four five six seven eight", "likelihood": 149.29747009277 }, { "text": "one two three four five six seven eight it", "likelihood": 148.27366638184 }, { "text": "one two three four five six seven eight and", "likelihood": 148.16448974609 }, { "text": "one two three four five six seven eight ten", "likelihood": 147.72589111328 }, { "text": "one two three four five six seven eight two", "likelihood": 147.65536499023 }, { "text": "one two three four five six seven a m", "likelihood": 147.52960205078 }, { "text": "one two three four or five six seven eight", "likelihood": 147.37252807617 }, { "text": "one two three four five six seven a day", "likelihood": 147.34791564941 }, { "text": "one two three four five six seven eight to", "likelihood": 147.19779968262 }, { "text": "one two three four five six seven eight the", "likelihood": 146.8786315918 } ] }

Do you have any suggestions?

— Reply to this email directly or view it on GitHub https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-106306674 .

rikrd commented 9 years ago

Sorry for the late response. Would it make sense to have the nbest_results as the actual results? The text item in the root dictionary is basically the first of the nbest_results.

Would it also make sense to have the phone alignment be part of each of the nbest_results (currently only of the first)? In the future we may want to have the phone alignments of multiple the hypotheses.

We can then remove the likelihood item from the root dictionary, and probably the likelihood_per_frame that can be easily computed.

Here is how the results would be:

{
  "num_frames": 615,
  "results": [
    {
      "text": "one two three four five six seven eight",
      "likelihood": 149.29747009277,
      "phone_alignment": "SIL 1.21\nHH_B 0.09\nW_I 0.08\nAH_I 0.09\nN_E 0.14\nSIL 0.04\nT_B 0.13\nUW_E 0.18\nTH_B 0.17\nR_I 0.06\nIY_E 0.18\nF_B 0.2\nAO_I 0.27\nR_E 0.06\nF_B 0.19\nAY_I 0.15\nV_E 0.05\nS_B 0.18\nIH_I 0.06\nK_I 0.12\nS_E 0.05\nS_B 0.1\nEH_I 0.07\nV_I 0.05\nAH_I 0.06\nN_E 0.14\nSIL 0.16\nEY_B 0.2\nT_E 0.04\nSIL 1.63\n",
    },
    {
      "text": "one two three four five six seven eight it",
      "likelihood": 148.27366638184
    },
    {
      "text": "one two three four five six seven eight and",
      "likelihood": 148.16448974609
    },
    {
      "text": "one two three four five six seven eight ten",
      "likelihood": 147.72589111328
    },
    {
      "text": "one two three four five six seven eight two",
      "likelihood": 147.65536499023
    },
    {
      "text": "one two three four five six seven a m",
      "likelihood": 147.52960205078
    },
    {
      "text": "one two three four or five six seven eight",
      "likelihood": 147.37252807617
    },
    {
      "text": "one two three four five six seven a day",
      "likelihood": 147.34791564941
    },
    {
      "text": "one two three four five six seven eight to",
      "likelihood": 147.19779968262
    },
    {
      "text": "one two three four five six seven eight the",
      "likelihood": 146.8786315918
    }
  ]
}

Furthermore the phone alignments can also be made into JSON. This has the disadvantage of being quite redundant, but it does make it more extensible in the future.

"phone_alignment": [{"duration": "1.21", "phone": "SIL"}, {"duration": "0.09", "phone": "HH_B"}, {"duration": "0.08", "phone": "W_I"}, {"duration": "0.09", "phone": "AH_I"}, {"duration": "0.14", "phone": "N_E"}, {"duration": "0.04", "phone": "SIL"}, {"duration": "0.13", "phone": "T_B"}, {"duration": "0.18", "phone": "UW_E"}, {"duration": "0.17", "phone": "TH_B"}, {"duration": "0.06", "phone": "R_I"}, {"duration": "0.18", "phone": "IY_E"}, {"duration": "0.2", "phone": "F_B"}, {"duration": "0.27", "phone": "AO_I"}, {"duration": "0.06", "phone": "R_E"}, {"duration": "0.19", "phone": "F_B"}, {"duration": "0.15", "phone": "AY_I"}, {"duration": "0.05", "phone": "V_E"}, {"duration": "0.18", "phone": "S_B"}, {"duration": "0.06", "phone": "IH_I"}, {"duration": "0.12", "phone": "K_I"}, {"duration": "0.05", "phone": "S_E"}, {"duration": "0.1", "phone": "S_B"}, {"duration": "0.07", "phone": "EH_I"}, {"duration": "0.05", "phone": "V_I"}, {"duration": "0.06", "phone": "AH_I"}, {"duration": "0.14", "phone": "N_E"}, {"duration": "0.16", "phone": "SIL"}, {"duration": "0.2", "phone": "EY_B"}, {"duration": "0.04", "phone": "T_E"}, {"duration": "1.63", "phone": "SIL"}],

rikrd commented 9 years ago

Sorry, the phone alignments duration should of course be a float:

"phone_alignment": [{"duration": 1.21, "phone": "SIL"}, {"duration": 0.09, "phone": "HH_B"}, {"duration": 0.08, "phone": "W_I"}, {"duration": 0.09, "phone": "AH_I"}, {"duration": 0.14, "phone": "N_E"}, {"duration": 0.04, "phone": "SIL"}, {"duration": 0.13, "phone": "T_B"}, {"duration": 0.18, "phone": "UW_E"}, {"duration": 0.17, "phone": "TH_B"}, {"duration": 0.06, "phone": "R_I"}, {"duration": 0.18, "phone": "IY_E"}, {"duration": 0.2, "phone": "F_B"}, {"duration": 0.27, "phone": "AO_I"}, {"duration": 0.06, "phone": "R_E"}, {"duration": 0.19, "phone": "F_B"}, {"duration": 0.15, "phone": "AY_I"}, {"duration": 0.05, "phone": "V_E"}, {"duration": 0.18, "phone": "S_B"}, {"duration": 0.06, "phone": "IH_I"}, {"duration": 0.12, "phone": "K_I"}, {"duration": 0.05, "phone": "S_E"}, {"duration": 0.1, "phone": "S_B"}, {"duration": 0.07, "phone": "EH_I"}, {"duration": 0.05, "phone": "V_I"}, {"duration": 0.06, "phone": "AH_I"}, {"duration": 0.14, "phone": "N_E"}, {"duration": 0.16, "phone": "SIL"}, {"duration": 0.2, "phone": "EY_B"}, {"duration": 0.04, "phone": "T_E"}, {"duration": 1.63, "phone": "SIL"}]

alumae commented 9 years ago

Richard, I realized it already myself and I'm working on it :)

rikrd commented 9 years ago

Good, sorry again for the delay on the initial reply, was quite busy this week. If you need a hand on the implementation side, just let me know.

alumae commented 9 years ago

Now the JSON looks like this:

{
  "num-frames": 615,
  "status": 0,
  "result": {
    "hypotheses": [
      {
        "transcript": "one two three four five six seven eight",
        "likelihood": 149.29747009277,
        "likelihood-per-frame": 0.24276011397199,
        "phone-alignment": [
          {
            "phone": "SIL",
            "start": 0,
            "length": 1.2099999189377
          },
          {
            "phone": "HH_B",
            "start": 1.2099999189377,
            "length": 0.089999996125698
          },
          {
            "phone": "W_I",
            "start": 1.2999999523163,
            "length": 0.079999998211861
          },
          {
            "phone": "AH_I",
            "start": 1.3799999952316,
            "length": 0.089999996125698
          },
          {
            "phone": "N_E",
            "start": 1.4699999094009,
            "length": 0.14000000059605
          },
          {
            "phone": "SIL",
            "start": 1.6100000143051,
            "length": 0.03999999910593
          },
          {
            "phone": "T_B",
            "start": 1.6499999761581,
            "length": 0.12999999523163
          },
          {
            "phone": "UW_E",
            "start": 1.7799999713898,
            "length": 0.1799999922514
          },
          {
            "phone": "TH_B",
            "start": 1.9599999189377,
            "length": 0.17000000178814
          },
          {
            "phone": "R_I",
            "start": 2.1299998760223,
            "length": 0.059999998658895
          },
          {
            "phone": "IY_E",
            "start": 2.1900000572205,
            "length": 0.1799999922514
          },
          {
            "phone": "F_B",
            "start": 2.3699998855591,
            "length": 0.19999998807907
          },
          {
            "phone": "AO_I",
            "start": 2.5699999332428,
            "length": 0.26999998092651
          },
          {
            "phone": "R_E",
            "start": 2.8399999141693,
            "length": 0.059999998658895
          },
          {
            "phone": "F_B",
            "start": 2.8999998569489,
            "length": 0.18999999761581
          },
          {
            "phone": "AY_I",
            "start": 3.0899999141693,
            "length": 0.1499999910593
          },
          {
            "phone": "V_E",
            "start": 3.2400000095367,
            "length": 0.049999997019768
          },
          {
            "phone": "S_B",
            "start": 3.289999961853,
            "length": 0.1799999922514
          },
          {
            "phone": "IH_I",
            "start": 3.4700000286102,
            "length": 0.059999998658895
          },
          {
            "phone": "K_I",
            "start": 3.5299999713898,
            "length": 0.11999999731779
          },
          {
            "phone": "S_E",
            "start": 3.6499998569489,
            "length": 0.049999997019768
          },
          {
            "phone": "S_B",
            "start": 3.6999998092651,
            "length": 0.099999994039536
          },
          {
            "phone": "EH_I",
            "start": 3.7999999523163,
            "length": 0.070000000298023
          },
          {
            "phone": "V_I",
            "start": 3.8699998855591,
            "length": 0.049999997019768
          },
          {
            "phone": "AH_I",
            "start": 3.9199998378754,
            "length": 0.059999998658895
          },
          {
            "phone": "N_E",
            "start": 3.9800000190735,
            "length": 0.14000000059605
          },
          {
            "phone": "SIL",
            "start": 4.1199998855591,
            "length": 0.15999999642372
          },
          {
            "phone": "EY_B",
            "start": 4.2799997329712,
            "length": 0.19999998807907
          },
          {
            "phone": "T_E",
            "start": 4.4800000190735,
            "length": 0.03999999910593
          },
          {
            "phone": "SIL",
            "start": 4.5199999809265,
            "length": 1.6299999952316
          }
        ]
      },
      {
        "transcript": "one two three four five six seven eight it",
        "likelihood": 148.27366638184,
        "likelihood-per-frame": 0.24109539249079
      },
      {
        "transcript": "one two three four five six seven eight and",
        "likelihood": 148.16448974609,
        "likelihood-per-frame": 0.24091786950584
      },
      {
        "transcript": "one two three four five six seven eight ten",
        "likelihood": 147.72589111328,
        "likelihood-per-frame": 0.24020470099721
      },
      {
        "transcript": "one two three four five six seven eight two",
        "likelihood": 147.65536499023,
        "likelihood-per-frame": 0.24009002437436
      },
      {
        "transcript": "one two three four five six seven a m",
        "likelihood": 147.52960205078,
        "likelihood-per-frame": 0.23988553178989
      },
      {
        "transcript": "one two three four or five six seven eight",
        "likelihood": 147.37251281738,
        "likelihood-per-frame": 0.23963010214209
      },
      {
        "transcript": "one two three four five six seven a day",
        "likelihood": 147.34791564941,
        "likelihood-per-frame": 0.23959010674701
      },
      {
        "transcript": "one two three four five six seven eight to",
        "likelihood": 147.19779968262,
        "likelihood-per-frame": 0.23934601574409
      },
      {
        "transcript": "one two three four five six seven eight the",
        "likelihood": 146.8786315918,
        "likelihood-per-frame": 0.23882704323869
      }
    ]
  }
}

I have to figure out how to format the floats of fractional times properly.

rikrd commented 9 years ago

This is looking better. I like the idea of having 'hypotheses' as one of the possible results. In the future we may want to add other results (Confusion Word Networks, Lattices, etc...).

Sorry for being picky, but I have been thinking a bit about this and here are a few concerns that arise:

Is the likelihood-per-frame really needed (or could it be done in the consumer as simply likelihood/num_frames)?
In the phone alignments, I don't know if the start is really needed either, since the accumulation of previous durations would be enough.
Seconds (float) may not be the best representation of the duration. Maybe we should use samples (int) and then add the samplerate used by the model to the root dictionary (GST may have resampled the audio to adapt it to the model's). This could be a solution to the problem with fractional times that you raise.
Frames is a concept internal to the speech recognition modelling system, maybe it is better to talk in terms of audio samples.

Below an example (assuming 100 samples per frame).

{
  "num-samples": 61500,
  "samplerate": 16000,
  "status": 0,
  "result": {
    "hypotheses": [
      {
        "transcript": "one two three four five six seven eight",
        "likelihood": 149.29747009277,
        "phone-alignment": [
          {
            "phone": "SIL",
            "length": 19200
          },
          {
            "phone": "HH_B",
            "length": 1440
          },
...

What do you think about this?

alumae commented 9 years ago

Yes, likelihood-per-frame is pretty pointless. However, I like the phone alignments to also include start times. It makes manual debugging the alignment results much easier. I kind of fixed the time representation problem by limiting the float precision in JSON results to 6. Of course, it also limits the other float values (e.g., likelihood becomes 149.297 instead of 149.29747009277) but I think it's OK. Another option would be to use milliseconds. I don't really like the samples-based representation, it also exposes internal details. Also, I think I'll replace num-frames with simple length.

amitbeka commented 9 years ago

I tend to agree with Tanel, I think milliseconds can be a nicer representation (or limited floats, but I prefer integers there). Limiting the precision of the likelihood is not a problem, because the extra digits are rarely significant. If we see that someone needs a more fine-grained likelihood, we can make set the precision as a parameter without breaking the API.

As for phone alignments - I think leaving the start time is easier to the user, as many users will need to compute it anyway when recieveing the results.

On Mon, Jun 1, 2015 at 5:05 PM, Tanel Alumäe notifications@github.com wrote:

Yes, likelihood-per-frame is pretty pointless. However, I like the phone alignments to also include start times. It makes manual debugging the alignment results much easier. I kind of fixed the time representation problem by limiting the float precision in JSON results to 6. Of course, it also limits the other float values (e.g., likelihood becomes 149.297 instead of 149.29747009277) but I think it's OK. Another option would be to use milliseconds. I don't really like the samples-based representation, it also exposes internal details. Also, I think I'll replace num-frames with simple length .

— Reply to this email directly or view it on GitHub https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107501993 .

rikrd commented 9 years ago

With relation to the start parameter of the phones I don't have any strong opinons on it, and I agree with you that it will make things easier for users.

The only potential problem of using milliseconds for timing results is the apparition of drifts over time. If we are using this in a streaming setting where we are transcribing long segments of uninterrupted audio, these precision errors will become a problem, if the client tries to synchronise them with the audio. One solution is to have the times be relative to the beginning of the audio (not relative to the reult), however then we need this value to be able to grow infinitely. Another solution is to use the samples (in integers), since then there are no precision errors, and keep the timings relative to the current result (as it currently happens). One final solution is to represent times as rational numbers (with a numerator and denominator), but this makes the handling way more complex (I believe, though not sure, that this is the solution used in FFMPEG and other AV libs).

On Mon, Jun 1, 2015 at 3:30 PM, Amit Beka notifications@github.com wrote:

I tend to agree with Tanel, I think milliseconds can be a nicer representation (or limited floats, but I prefer integers there). Limiting the precision of the likelihood is not a problem, because the extra digits are rarely significant. If we see that someone needs a more fine-grained likelihood, we can make set the precision as a parameter without breaking the API.

As for phone alignments - I think leaving the start time is easier to the user, as many users will need to compute it anyway when recieveing the results.

On Mon, Jun 1, 2015 at 5:05 PM, Tanel Alumäe notifications@github.com wrote:

Yes, likelihood-per-frame is pretty pointless. However, I like the phone alignments to also include start times. It makes manual debugging the alignment results much easier. I kind of fixed the time representation problem by limiting the float precision in JSON results to 6. Of course, it also limits the other float values (e.g., likelihood becomes 149.297 instead of 149.29747009277) but I think it's OK. Another option would be to use milliseconds. I don't really like the samples-based representation, it also exposes internal details. Also, I think I'll replace num-frames with simple length .

— Reply to this email directly or view it on GitHub < https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107501993

.

— Reply to this email directly or view it on GitHub https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107523594 .

ricard http://twitter.com/ricardmp http://www.ricardmarxer.com http://www.caligraft.com

amitbeka commented 9 years ago

I've got a little lost here (audio part isn't my strong side, talk LM to me :)), but I guess we've surfaced most of the potential problems with the different representations, so you guys can decide on whatever seems best to you.

On Mon, Jun 1, 2015 at 5:52 PM, Ricard Marxer notifications@github.com wrote:

With relation to the start parameter of the phones I don't have any strong opinons on it, and I agree with you that it will make things easier for users.

The only potential problem of using milliseconds for timing results is the apparition of drifts over time. If we are using this in a streaming setting where we are transcribing long segments of uninterrupted audio, these precision errors will become a problem, if the client tries to synchronise them with the audio. One solution is to have the times be relative to the beginning of the audio (not relative to the reult), however then we need this value to be able to grow infinitely. Another solution is to use the samples (in integers), since then there are no precision errors, and keep the timings relative to the current result (as it currently happens). One final solution is to represent times as rational numbers (with a numerator and denominator), but this makes the handling way more complex (I believe, though not sure, that this is the solution used in FFMPEG and other AV libs).

On Mon, Jun 1, 2015 at 3:30 PM, Amit Beka notifications@github.com wrote:

I tend to agree with Tanel, I think milliseconds can be a nicer representation (or limited floats, but I prefer integers there). Limiting the precision of the likelihood is not a problem, because the extra digits are rarely significant. If we see that someone needs a more fine-grained likelihood, we can make set the precision as a parameter without breaking the API.

As for phone alignments - I think leaving the start time is easier to the user, as many users will need to compute it anyway when recieveing the results.

On Mon, Jun 1, 2015 at 5:05 PM, Tanel Alumäe notifications@github.com wrote:

Yes, likelihood-per-frame is pretty pointless. However, I like the phone alignments to also include start times. It makes manual debugging the alignment results much easier. I kind of fixed the time representation problem by limiting the float precision in JSON results to 6. Of course, it also limits the other float values (e.g., likelihood becomes 149.297 instead of 149.29747009277) but I think it's OK. Another option would be to use milliseconds. I don't really like the samples-based representation, it also exposes internal details. Also, I think I'll replace num-frames with simple length .

— Reply to this email directly or view it on GitHub <

https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107501993

.

— Reply to this email directly or view it on GitHub < https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107523594

.

ricard http://twitter.com/ricardmp http://www.ricardmarxer.com http://www.caligraft.com

— Reply to this email directly or view it on GitHub https://github.com/alumae/gst-kaldi-nnet2-online/issues/19#issuecomment-107547240 .

alumae commented 9 years ago

I added segment-start, segment-length and total-length (total time processed so far) to JSON, now it looks like:

{
  "segment-start": 58.57,
  "status": 0,
  "result": {
    "hypotheses": [
      {
        "transcript": "we're not ready for the next epidemic",
        "likelihood": 120.148,
        "phone-alignment": [
          {
            "phone": "SIL",
            "length": 0.39,
            "start": 0
          },
          {
            "phone": "W_B",
            "length": 0.18,
            "start": 0.39
          },
          {
            "phone": "ER_E",
            "length": 0.06,
            "start": 0.57
          },
          {
            "phone": "N_B",
            "length": 0.06,
            "start": 0.63
          },
          {
            "phone": "AA_I",
            "length": 0.19,
            "start": 0.69
          },
          {
            "phone": "T_E",
            "length": 0.11,
            "start": 0.88
          },
          {
            "phone": "R_B",
            "length": 0.07,
            "start": 0.99
          },
          {
            "phone": "EH_I",
            "length": 0.1,
            "start": 1.06
          },
          {
            "phone": "D_I",
            "length": 0.05,
            "start": 1.16
          },
          {
            "phone": "IY_E",
            "length": 0.22,
            "start": 1.21
          },
          {
            "phone": "SIL",
            "length": 0.46,
            "start": 1.43
          },
          {
            "phone": "F_B",
            "length": 0.1,
            "start": 1.89
          },
          {
            "phone": "ER_E",
            "length": 0.05,
            "start": 1.99
          },
          {
            "phone": "DH_B",
            "length": 0.05,
            "start": 2.04
          },
          {
            "phone": "AH_E",
            "length": 0.05,
            "start": 2.09
          },
          {
            "phone": "N_B",
            "length": 0.06,
            "start": 2.14
          },
          {
            "phone": "EH_I",
            "length": 0.11,
            "start": 2.2
          },
          {
            "phone": "K_I",
            "length": 0.08,
            "start": 2.31
          },
          {
            "phone": "S_I",
            "length": 0.05,
            "start": 2.39
          },
          {
            "phone": "T_E",
            "length": 0.07,
            "start": 2.44
          },
          {
            "phone": "EH_B",
            "length": 0.08,
            "start": 2.51
          },
          {
            "phone": "P_I",
            "length": 0.09,
            "start": 2.59
          },
          {
            "phone": "AH_I",
            "length": 0.04,
            "start": 2.68
          },
          {
            "phone": "D_I",
            "length": 0.08,
            "start": 2.72
          },
          {
            "phone": "EH_I",
            "length": 0.1,
            "start": 2.8
          },
          {
            "phone": "M_I",
            "length": 0.08,
            "start": 2.9
          },
          {
            "phone": "IH_I",
            "length": 0.08,
            "start": 2.98
          },
          {
            "phone": "K_E",
            "length": 0.18,
            "start": 3.06
          },
          {
            "phone": "SIL",
            "length": 0.13,
            "start": 3.24
          }
        ]
      },
      {
        "transcript": "were not ready for the next epidemic",
        "likelihood": 117.297
      }
    ]
  },
  "segment-length": 3.37,
  "total-length": 61.94
}

I do the length bookkeeping in seconds (floats), rather than samples. I don't there should be any systematic drifts, the floating point imprecision resulting from many sums shouldn't drift in either direction.

rikrd commented 9 years ago

Ok, you may be right about the fact that there should be no drift. In any case having the bookkeeping in the gst plugin allows us to solve it later if it were a problem, without needing to change the clients. I like this solution.