I am attempting to transcribe an audio file that is quite long >30 minutes that is stereo audio (2 channels). I split the audio to be a left side and right side and transcribe them separately and parse through the results to join them back together because I do not see an option to recognize speakers or if there is an indicator to see what channel said what. I noticed that there looks to be a timing issue when receiving results. I am writing out the transcription results to a file to parse it out later. Below is a snippet of the results that is being returned, notice how the "start" time for the word "training" is at 1203.15 seconds and then the "start" time for the word "this" is at 1800.0 seconds. I can guarantee the audio files I am transcribing do not have a 10 minute gap or silence. It looks like this issue happens at every 20 minute mark (1200 seconds, 1800 seconds, 2400 seconds, etc). This really messes up parsing because I rely on the timestamps to join both audio channels back together to create a conversation.
I am attempting to transcribe an audio file that is quite long >30 minutes that is stereo audio (2 channels). I split the audio to be a left side and right side and transcribe them separately and parse through the results to join them back together because I do not see an option to recognize speakers or if there is an indicator to see what channel said what. I noticed that there looks to be a timing issue when receiving results. I am writing out the transcription results to a file to parse it out later. Below is a snippet of the results that is being returned, notice how the "start" time for the word "training" is at 1203.15 seconds and then the "start" time for the word "this" is at 1800.0 seconds. I can guarantee the audio files I am transcribing do not have a 10 minute gap or silence. It looks like this issue happens at every 20 minute mark (1200 seconds, 1800 seconds, 2400 seconds, etc). This really messes up parsing because I rely on the timestamps to join both audio channels back together to create a conversation.
{ "result": [ { "conf": 1.000000, "end": 1199.850000, "start": 1199.550000, "word": "training" }, { "conf": 0.999999, "end": 1202.610000, "start": 1202.490000, "word": "for" }, { "conf": 0.999999, "end": 1203.060000, "start": 1202.610000, "word": "quality" }, { "conf": 0.999994, "end": 1203.150000, "start": 1203.060000, "word": "and" }, { "conf": 1.000000, "end": 1203.480000, "start": 1203.150000, "word": "training" } ], "text": "training for quality and training" }, { "result": [ { "conf": 1.000000, "end": 1808.120000, "start": 1808.000000, "word": "this" }, { "conf": 1.000000, "end": 1808.360000, "start": 1808.120000, "word": "call" }, { "conf": 1.000000, "end": 1808.480000, "start": 1808.360000, "word": "may" }, { "conf": 1.000000, "end": 1808.600000, "start": 1808.480000, "word": "be" } ] }