Open mike-nelson opened 9 years ago
I haven't exactly given up, but as you can see I haven't done anything for a couple of years.
It's just a matter of getting 150 hours of precisely transcribed NZE speech.
Wow, it is amazing that is so hard to come by...
I was thinking the find and replace method could be workable (replacing phonemes in the sphinx file), although you would need some pretty well thought out patterns.
Cheers -Mike
Mike Nelson // CEO, Beweb // 09 3077042 // 027 4403757
On 28 April 2015 at 22:22, Douglas Bagnall notifications@github.com wrote:
I haven't exactly given up, but as you can see I haven't done anything for a couple of years.
It's just a matter of getting 150 hours of precisely transcribed NZE speech.
Reply to this email directly or view it on GitHub https://github.com/douglasbagnall/nze-vox/issues/1#issuecomment-97004051 .
OK, you've reminded me there's more to it than the speech corpus.
Pocketsphinx also a dictionary mapping phoneme sequences to words (and a language model that evaluates the probability of word sequences, but that is probably less difficult). The US dialect described in the CMU pronouncing dictionary uses 39 phonemes. New Zealanders use 42 or 43, and they are not a simple superset. There is no automated way to tease them apart. For examples, see this page -- in General American English, "father" rhymes with "bother", and there is no automated search-and-replace way of fixing this up for NZE -- you pretty much have to tear all the vowels apart and arbitrarily reassemble them.
For the speech corpus you need to transcribe every little dysfluency -- the ums and ahs and false starts -- and the speech needs to be in short chunks (10-30 seconds if I recall correctly), and well recorded. I don't believe there is any short cut.
Recently people have been doing end-to-end transcription using recurrent neural networks, which cuts out some of these fiddly steps. At the same time it cuts out all the infrastructure that pocketsphinx has built up over the years. There isn't going to be an easy solution soon.
It does seem that if they are missing some sounds that we use then it would never be perfect - I wonder how limiting that is. I imagine it would be better than using straight US english at least.
You were saying we need 150 hours of speech with accurate transcription. It occurred to me that some text-to-speech engines eg system voice on iPhone or Mac have pretty reasonable NZ English. I wonder if you could run the CMU dictionary or standard corpus file through one of those and get the computer to record itself in correctly sliced up and named files?? That would be very accurate transcriptions! (Thats as far as I got though...!)
For my particular app I want to make sure it recognises words in any accent. For example I want to recognise the word "archive" so I am thinking I may be able to create a limited language model of just the words I want to recognise, including both forms something like this:
AA R K AY V <-- US AA K AY V <-- NZ? Just US without the 'R'?
From reading: http://www.speech.cs.cmu.edu/cgi-bin/cmudict?in=archive http://cmusphinx.sourceforge.net/wiki/tutoriallm
Cheers -Mike
Mike Nelson // CEO, Beweb // 09 3077042 // 027 4403757
On 29 April 2015 at 23:09, Douglas Bagnall notifications@github.com wrote:
OK, you've reminded me there's more to it than the speech corpus.
Pocketsphinx also a dictionary mapping phoneme sequences to words (and a language model that evaluates the probability of word sequences, but that is probably less difficult). The US dialect described in the CMU pronouncing dictionary uses 39 phonemes. New Zealanders use 42 or 43, and they are not a simple superset. There is no automated way to tease them apart. For examples, see this page https://github.com/douglasbagnall/nze-vox/wiki/Cmudict%27s-dialect-empirically-determined#father-bother-merged -- in General American English, "father" rhymes with "bother", and there is no automated search-and-replace way of fixing this up for NZE -- you pretty much have to tear all the vowels apart and arbitrarily reassemble them.
For the speech corpus you need to transcribe every little dysfluency -- the ums and ahs and false starts -- and the speech needs to be in short chunks (10-30 seconds if I recall correctly), and well recorded. I don't believe there is any short cut.
Recently people have been doing end-to-end transcription using recurrent neural networks, which cuts out some of these fiddly steps. At the same time it cuts out all the infrastructure that pocketsphinx has built up over the years. There isn't going to be an easy solution soon.
Reply to this email directly or view it on GitHub https://github.com/douglasbagnall/nze-vox/issues/1#issuecomment-97390088 .
Yes, if you have a limited vocabulary it is all quite a lot simpler. You can probably just adapt an existing model using MAP (http://cmusphinx.sourceforge.net/wiki/tutorialadapt ) and use a grammar instead of a language model.
Hi, just wondering if there has been any progress on this? I am writing an app using pocketsphinx.