Open rexendevar opened 4 months ago
Having the same problem here on CPU only and on CUDA build . The reason behind is a time missmatch/shift in token association. The following example produced by using -ojf states the word "aber" be at 00:02:51,940 while real position in audio file is at 00:02:34,200. A few words later it believes to be at time 00:03:00, skips the audio between real position and continues with a gap/missing text. This happening frequently until end of file.
Example with -nt
{
"text": " aber",
"timestamps": {
"from": "00:02:51,940",
"to": "00:02:53,910"
},
"offsets": {
"from": 171940,
"to": 173910
},
"id": 4340,
"p": 0.998843,
"t_dtw": -1
},
{
"text": " nicht",
"timestamps": {
"from": "00:02:53,930",
"to": "00:02:56,430"
},
"offsets": {
"from": 173930,
"to": 176430
},
"id": 1979,
"p": 0.999786,
"t_dtw": -1
},
{
"text": " mehr",
"timestamps": {
"from": "00:02:56,440",
"to": "00:02:58,430"
},
"offsets": {
"from": 176440,
"to": 178430
},
"id": 5417,
"p": 0.999682,
"t_dtw": -1
},
{
"text": ".",
"timestamps": {
"from": "00:02:58,430",
"to": "00:02:59,980"
},
"offsets": {
"from": 178430,
"to": 179980
},
"id": 13,
"p": 0.999716,
"t_dtw": -1
},
{
"text": "[_EOT_]",
"timestamps": {
"from": "00:03:00,000",
"to": "00:03:00,000"
},
"offsets": {
"from": 180000,
"to": 180000
},
"id": 50257,
"p": 0.244065,
"t_dtw": -1
}
While not using -nt produces correct timestamps:
{
"text": " aber",
"timestamps": {
"from": "00:02:34,200",
"to": "00:02:34,400"
},
"offsets": {
"from": 154200,
"to": 154400
},
"id": 4340,
"p": 0.999746,
"t_dtw": -1
},
{
"text": " nicht",
"timestamps": {
"from": "00:02:34,430",
"to": "00:02:34,710"
},
"offsets": {
"from": 154430,
"to": 154710
},
"id": 1979,
"p": 0.999973,
"t_dtw": -1
},
{
"text": " mehr",
"timestamps": {
"from": "00:02:34,710",
"to": "00:02:34,920"
},
"offsets": {
"from": 154710,
"to": 154920
},
"id": 5417,
"p": 0.999939,
"t_dtw": -1
},
{
"text": ".",
"timestamps": {
"from": "00:02:34,930",
"to": "00:02:35,140"
},
"offsets": {
"from": 154930,
"to": 155140
},
"id": 13,
"p": 0.999998,
"t_dtw": -1
},
{
"text": "[_TT_1127]",
"timestamps": {
"from": "00:02:35,140",
"to": "00:02:35,140"
},
"offsets": {
"from": 155140,
"to": 155140
},
"id": 51492,
"p": 0.711062,
"t_dtw": -1
}
(both json are from same input file) - sorry about the german language content
It seems to be producing different transcription outputs, depending on whether the no-transcriptions flag is enabled. You can see even on the JFK.wav file it's removing commas.