cmusphinx / sphinx4

Pure Java speech recognition library
cmusphinx.sourceforge.net
Other
1.41k stars 587 forks source link

Detect speech pauses; Out of Memory Crash #14

Open jonathanglasmeyer opened 10 years ago

jonathanglasmeyer commented 10 years ago

I'm not entirely sure if this is the best place to ask these kind of questions, so please point me to a better place in case there is one.

We are currently using the Sphinx4 Long Aligner with some success for a subtitling project at University Hamburg.

Today was the first time that I tried it successfully "in the field". I took the transcription and this video from the CCC Congress and aligned the 35min video (of course i mean the converted wav according to your instructions) in ~88 min with Sphinx Long Aligner, which is pretty good i think. (You can see the (manually optimized) results on the linked video page.)

So right now the biggest problem for this application are pauses in speech. The words are always directly next to each other even if there are long pauses. This means a lot of manual dragging around of the results. Long story short: is there an option to turn on speech pauses detection?

Also, a little second problem: when trying the Aligner with a >50min audio, it fails with an Out of Memory error at the liveCMN stage (the java vm has a 7G limit), after about 2h. Is there a way to change this?

Thanks for your help and your great work, that enables us to work on subtitling the CCC videos a magnitude faster.

nshmyrev commented 10 years ago

Hi Jonathan

Thanks for using CMUSphinx

Could you please elaborate more on this problem with pauses? I'm not sure I get it.

Also please share the problematic files where you have issues with aligner.

Thank you.

jonathanglasmeyer commented 10 years ago

Hi, so say the speaker makes a longer pause. Then this pause isn't represented in the timing information of the last word before the pause and the first word after the pause -- they are aligned as though they would be directly next to each other.

So an example where it failed with the same error on 2 pc's is this audio with this transcription.

The Aligner is running for ~45min and than hangs at the same position in the logging output (it just stands still, for >60min)

.
.
.

INFO: Skipping text range due to a high density [and]
Oct 25, 2014 8:55:49 PM edu.cmu.sphinx.api.SpeechAligner align
INFO: Aligning frame 0:15580 to text [id, like, to, introduce, our, speaker, here, patrick, here, has, made, a, carrer, of, datamining, for, good, prosecuting, war, crimes, got, a, conviction, in, his, own, country, gouatemala, thank] range edu.cmu.sphinx.util.Range@61dfee2f
20:55:49.086 INFO dictionary           Loading dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/dict/cmudict.0.6d
20:55:49.175 INFO dictionary           Loading filler dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/noisedict
20:55:49.176 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.176 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.177 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.597 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'carrer'
20:55:49.598 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'datamining'
20:55:49.598 INFO dictionary           The dictionary is missing a phonetic transcription for the word 'gouatemala'
20:55:49.608 INFO lexTreeLinguist      Max CI Units 50
20:55:49.608 INFO lexTreeLinguist      Unit table size 125000
20:55:49.640 INFO liveCMN              15.56 -0.79 -1.05 -0.39 -0.27 -0.12 -0.13 -0.16 -0.17 -0.15 -0.19 -0.16 -0.16 
20:55:49.684 INFO liveCMN              13.81 -0.78 -0.88 -0.39 -0.23 -0.10 -0.10 -0.12 -0.16 -0.15 -0.16 -0.15 -0.14 
20:55:49.823 INFO liveCMN              11.58 -0.66 -0.60 -0.30 -0.12 -0.03 -0.02 -0.07 -0.12 -0.12 -0.12 -0.11 -0.13 
20:55:50.114 INFO liveCMN              11.15 -0.74 -0.72 -0.33 -0.12 0.00 -0.01 -0.05 -0.11 -0.06 -0.10 -0.11 -0.10 
20:55:50.729 INFO liveCMN              12.25 -0.87 -0.85 -0.39 -0.17 -0.03 -0.06 -0.06 -0.11 -0.07 -0.10 -0.11 -0.12 
20:55:51.236 INFO liveCMN              13.46 -0.75 -0.88 -0.39 -0.15 -0.05 -0.07 -0.06 -0.11 -0.10 -0.12 -0.13 -0.13

So here it is probably not a Out of Memory problem, but some other kind ..

Could this be correlated to bad quality of the transcription?

mbait commented 10 years ago

Hi Jonathan,

Then this pause isn't represented in the timing information of the last word before the pause and the first word after the pause -- they are aligned as though they would be directly next to each other.

It still isn't clear what's your expected and actual output.

On Sun, Oct 26, 2014 at 9:11 AM, Jonathan Werner notifications@github.com wrote:

Hi, so say the speaker makes a longer pause. Then this pause isn't represented in the timing information of the last word before the pause and the first word after the pause -- they are aligned as though they would be directly next to each other.

So an example where it failed with the same error on 2 pc's is this audio https://transfer.sh/fd21Z/datamining.wav with [this transcription] https://transfer.sh/fd21Z/datamining.wav).

The Aligner is running for ~45min and than hangs at the same position in the logging output (it just stands still, for >60min)

. . .

INFO: Skipping text range due to a high density [and] Oct 25, 2014 8:55:49 PM edu.cmu.sphinx.api.SpeechAligner align INFO: Aligning frame 0:15580 to text [id, like, to, introduce, our, speaker, here, patrick, here, has, made, a, carrer, of, datamining, for, good, prosecuting, war, crimes, got, a, conviction, in, his, own, country, gouatemala, thank] range edu.cmu.sphinx.util.Range@61dfee2f 20:55:49.086 INFO dictionary Loading dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/dict/cmudict.0.6d 20:55:49.175 INFO dictionary Loading filler dictionary from: jar:file:/home/jwerner/dev/prosub/modules/aligner/sphinx4-samples.jar!/edu/cmu/sphinx/models/acoustic/wsj/noisedict 20:55:49.176 INFO dictionary The dictionary is missing a phonetic transcription for the word 'carrer' 20:55:49.176 INFO dictionary The dictionary is missing a phonetic transcription for the word 'datamining' 20:55:49.177 INFO dictionary The dictionary is missing a phonetic transcription for the word 'gouatemala' 20:55:49.597 INFO dictionary The dictionary is missing a phonetic transcription for the word 'carrer' 20:55:49.598 INFO dictionary The dictionary is missing a phonetic transcription for the word 'datamining' 20:55:49.598 INFO dictionary The dictionary is missing a phonetic transcription for the word 'gouatemala' 20:55:49.608 INFO lexTreeLinguist Max CI Units 50 20:55:49.608 INFO lexTreeLinguist Unit table size 125000 20:55:49.640 INFO liveCMN 15.56 -0.79 -1.05 -0.39 -0.27 -0.12 -0.13 -0.16 -0.17 -0.15 -0.19 -0.16 -0.16 20:55:49.684 INFO liveCMN 13.81 -0.78 -0.88 -0.39 -0.23 -0.10 -0.10 -0.12 -0.16 -0.15 -0.16 -0.15 -0.14 20:55:49.823 INFO liveCMN 11.58 -0.66 -0.60 -0.30 -0.12 -0.03 -0.02 -0.07 -0.12 -0.12 -0.12 -0.11 -0.13 20:55:50.114 INFO liveCMN 11.15 -0.74 -0.72 -0.33 -0.12 0.00 -0.01 -0.05 -0.11 -0.06 -0.10 -0.11 -0.10 20:55:50.729 INFO liveCMN 12.25 -0.87 -0.85 -0.39 -0.17 -0.03 -0.06 -0.06 -0.11 -0.07 -0.10 -0.11 -0.12 20:55:51.236 INFO liveCMN 13.46 -0.75 -0.88 -0.39 -0.15 -0.05 -0.07 -0.06 -0.11 -0.10 -0.12 -0.13 -0.13

So here it is not probably not a Out of Memory problem, but some other kind ..

Could this be correlated to bad quality of the transcription?

— Reply to this email directly or view it on GitHub https://github.com/cmusphinx/sphinx4/issues/14#issuecomment-60499223.

Sincerely, Alexander

jonathanglasmeyer commented 10 years ago

Ok, let me rephrase it with an example: Say you have two words A and B, with the following real start and stop times (in seconds): A start=2, stop=2.2 B start=4, stop=4.2

So you have a speech pause between 2.2 and 4. We would like to have this pause represented in the alignment.

But the actual alignment looks for example like this: A start=2, stop=2.2 B start=2.2, stop=4.2

nshmyrev commented 10 years ago

I can take a look

Btw, for better alignment quality you should better use en-us generic acoustic model:

http://sourceforge.net/projects/cmusphinx/files/Acoustic%20and%20Language%20Models/US%20English%20Generic%20Acoustic%20Model/en-us.tar.gz/download