Wrong timestamps while doing phonemes recognize.

Hi, I'm trying to recognition phonemes in Mandarin Chinese speech with en-us phonetic language model.

I feed in a 20 seconds audio file, and process audio with pocketsphinx.Decoder.process_raw()

There seems something wrong with the phonemes timestamp, some of the output are greater than 20 seconds.

Here are my test code:

import os
import soundfile as sf
import librosa
from pocketsphinx import pocketsphinx

MODEL_DIR = './pocketsphinx/model'

DATA_PATH = './mandarin_chinese_20s.wav'
TEMP_RAW_PATH = './temp.raw'

# Convert into 16KHz mono '.raw' file
y, sr = librosa.load(path=DATA_PATH, sr=16000, mono=True)
sf.write(file=TEMP_RAW_PATH, data=y, samplerate=sr, subtype='PCM_16', format='RAW')

# Create a decoder with certain model
config = pocketsphinx.Decoder.default_config()
config.set_string('-hmm', os.path.join(MODEL_DIR, 'en-us/en-us'))
config.set_string('-allphone', os.path.join(MODEL_DIR, 'en-us/en-us-phone.lm.bin'))
config.set_float('-lw', 2.0)
config.set_float('-pip', 0.3)
config.set_float('-beam', 1e-10)
config.set_float('-pbeam', 1e-10)
config.set_boolean('-mmap', False)

# Decode streaming data
decoder = pocketsphinx.Decoder(config)

decoder.start_utt()
stream = open(TEMP_RAW_PATH, 'rb')
while True:
    buf = stream.read(1024)
    if buf:
        decoder.process_raw(buf, False, False)
    else:
        break
decoder.end_utt()

# Frames per Second
fps = 100

for seg in decoder.seg():
    print(seg.word, seg.start_frame/fps, seg.end_frame/fps)

output:

INFO: cmn_live.c(88): Update from < 41.00 -5.29 -0.12  5.09  2.48 -4.07 -1.37 -1.78 -5.08 -2.05 -6.45 -1.42  1.17 >
INFO: cmn_live.c(105): Update to   < 60.57 -2.12 -28.96  4.03 -19.24  3.96 -11.54 -11.56 -8.87  3.39 13.46 -0.38  3.14 >
INFO: cmn_live.c(88): Update from < 60.57 -2.12 -28.96  4.03 -19.24  3.96 -11.54 -11.56 -8.87  3.39 13.46 -0.38  3.14 >
INFO: cmn_live.c(105): Update to   < 58.56 -2.86 -25.46  4.64 -20.59  2.39 -7.33 -12.13 -9.60  4.42  9.54 -2.47  2.93 >
SIL 18.39 18.49
T 18.5 18.61
EH 18.62 18.72
...
...
...
AA 29.62 29.74
L 29.75 29.77
HH 29.78 29.86
ER 29.87 30.0
D 30.01 30.1
AE 30.11 30.28
IY 30.29 30.45
HH 30.46 30.52
OW 30.53 30.6
S 30.61 30.7
L 30.71 30.76
AY 30.77 30.85
UW 30.86 30.91
EY 30.92 31.07
INFO: cmn_live.c(120): Update from < 58.56 -2.86 -25.46  4.64 -20.59  2.39 -7.33 -12.13 -9.60  4.42  9.54 -2.47  2.93 >
INFO: cmn_live.c(138): Update to   < 60.67 -3.52 -26.13  6.71 -22.00  3.12 -8.69 -12.51 -8.42  2.04  8.68 -2.51  3.32 >

and test audio file:

mandarin_chinese_20s.zip

I wonder if it a bug or Am I missing something? Thanks.

bambocher / pocketsphinx-python

Wrong timestamps while doing phonemes recognize. #43