bambocher / pocketsphinx-python

Python interface to CMU Sphinxbase and Pocketsphinx libraries
https://pypi.python.org/pypi/pocketsphinx
Other
373 stars 187 forks source link

Question. How to convert frame # into time coordinates? #17

Closed ghost closed 6 years ago

ghost commented 8 years ago

I'm in need of getting the time coordinates of every word sphinx thinks it identified.

I found out about segments() and realized that it's exactly what I need:

for phrase in AudioFile(audio_file="output.wav"):
    print phrase.segments(detailed=True)

However it is not clear to me what the frame # represents. It was immediately apparent that it is neither seconds nor sample #s.

I require a method to convert these frame #s into seconds.

Edit: Also, some words have a (#) following them, eg: 'and(2)', What does it represent?

bambocher commented 8 years ago

One solution:

from pocketsphinx import AudioFile

# Frames per Second
fps = 100

for phrase in AudioFile(frate=fps):  # frate (default=100)
    print('-' * 28)
    print('| %5s |  %3s  |   %4s   |' % ('start', 'end', 'word'))
    print('-' * 28)
    for s in phrase.segments(detailed=True):
        print('| %4ss | %4ss | %8s |' % (s[2] / fps, s[3] / fps, s[0]))
    print('-' * 28)

# ----------------------------
# | start |  end  |   word   |
# ----------------------------
# |  0.0s | 0.24s | <s>      |
# | 0.25s | 0.45s | <sil>    |
# | 0.46s | 0.63s | go       |
# | 0.64s | 1.16s | forward  |
# | 1.17s | 1.52s | ten      |
# | 1.53s | 2.11s | meters   |
# | 2.12s |  2.6s | </s>     |
# ----------------------------

Another solution:

from pocketsphinx import AudioFile

# Frames per Second
fps = 100

for phrase in AudioFile(frate=fps):  # frate (default=100)
    print('-' * 28)
    print('| %5s |  %3s  |   %4s   |' % ('start', 'end', 'word'))
    print('-' * 28)
    for s in phrase.seg():
        print('| %4ss | %4ss | %8s |' % (s.start_frame / fps, s.end_frame / fps, s.word))
    print('-' * 28)

# ----------------------------
# | start |  end  |   word   |
# ----------------------------
# |  0.0s | 0.24s | <s>      |
# | 0.25s | 0.45s | <sil>    |
# | 0.46s | 0.63s | go       |
# | 0.64s | 1.16s | forward  |
# | 1.17s | 1.52s | ten      |
# | 1.53s | 2.11s | meters   |
# | 2.12s |  2.6s | </s>     |
# ----------------------------

And the last solution:

from pocketsphinx import Pocketsphinx

ps = Pocketsphinx() # frate (default=100)
ps.decode()

print('-' * 28)
print('| %5s |  %3s  |   %4s   |' % ('start', 'end', 'word'))
print('-' * 28)
for s in ps.seg():
    print('| %4ss | %4ss | %8s |' % (s.start_frame / 100, s.end_frame / 100, s.word))
print('-' * 28)

# ----------------------------
# | start |  end  |   word   |
# ----------------------------
# |  0.0s | 0.24s | <s>      |
# | 0.25s | 0.45s | <sil>    |
# | 0.46s | 0.63s | go       |
# | 0.64s | 1.16s | forward  |
# | 1.17s | 1.52s | ten      |
# | 1.53s | 2.11s | meters   |
# | 2.12s |  2.6s | </s>     |
# ----------------------------

and(2) are words with alternative transcriptions, you can see them in the dictionary.

hasanian01 commented 4 years ago

Hi, I have tired to get the time coordinates but they are not correct. I tried all the codes above, but seems the given times are bigger than the duration of the audio file. Is there any other factor, attribute, can must be included? what is Decoder.n_frames()?

Thank you