Closed redsteakraw closed 8 years ago
Mimic (and flite which it is based on) use the CMU pronunciation dictionary version 0.4 to derive its pronunciations for American English. This dictionary contains a large number of pronunciation errors, inconsistencies and mixed accents. As such, the pronunciations vary in accuracy.
For the words you highlighted, cmudict 0.4 contains:
ATHEISM AH0 TH AY1 S AH0 M
ATHEIST EY1 TH IY0 AH0 S T
ATHEISTIC EY2 TH IY0 IH1 S T IH0 K
ATHEISTS EY1 TH IY0 AH0 S T S
PENIS P EH1 N IH0 S
This highlights what I mentioned above and explains why mimic/flite are pronouncing those words incorrectly.
In cmudict 0.6d, these are:
ATHEISM AH0 TH AY1 S AH0 M
ATHEISM(2) EY1 TH IY0 IH2 Z AH0 M
ATHEIST EY1 TH IY0 AH0 S T
ATHEISTIC EY2 TH IY0 IH1 S T IH0 K
ATHEISTS EY1 TH IY0 AH0 S T S
ATHEISTS(2) EY1 TH IY0 AH0 S S
ATHEISTS(3) EY1 TH IY0 AH0 S
PENIS P IY1 N IH0 S
So penis has been corrected in that version, but atheism is only correct in the alternate pronunciation.
Can we easily update the dictionary version then?
On Sat, Feb 27, 2016, 3:17 AM Reece H. Dunn notifications@github.com wrote:
Mimic (and flite which it is based on) use the CMU pronunciation dictionary version 0.4 to derive its pronunciations for American English. This dictionary contains a large number of pronunciation errors, inconsistencies and mixed accents. As such, the pronunciations vary in accuracy.
For the words you highlighted, cmudict 0.4 https://github.com/rhdunn/cmudict/tree/cmudict-0.4 contains:
ATHEISM AH0 TH AY1 S AH0 M ATHEIST EY1 TH IY0 AH0 S T ATHEISTIC EY2 TH IY0 IH1 S T IH0 K ATHEISTS EY1 TH IY0 AH0 S T S PENIS P EH1 N IH0 S
This highlights what I mentioned above and explains why mimic/flite are pronouncing those words incorrectly.
In cmudict 0.6d https://github.com/rhdunn/cmudict/tree/cmudict-0.6d, these are:
ATHEISM AH0 TH AY1 S AH0 M ATHEISM(2) EY1 TH IY0 IH2 Z AH0 M ATHEIST EY1 TH IY0 AH0 S T ATHEISTIC EY2 TH IY0 IH1 S T IH0 K ATHEISTS EY1 TH IY0 AH0 S T S ATHEISTS(2) EY1 TH IY0 AH0 S S ATHEISTS(3) EY1 TH IY0 AH0 S PENIS P IY1 N IH0 S
So penis has been corrected in that version, but atheism is only correct in the alternate pronunciation.
— Reply to this email directly or view it on GitHub https://github.com/MycroftAI/mimic/issues/15#issuecomment-189609908.
Ryan Sipes CTO, Mycroft A.I. https://mycroft.ai 785-979-6091
There is a make_cmulex
script in the lang/cmulex
directory. I'm not sure how easy this is to run, though, as I have not tried it. It does require festival (which requires speech-tools), and the festlex_CMU.tar.gz
file from http://www.cstr.ed.ac.uk/downloads/festival/2.4/. That script hard-codes various references, so may need some work.
I have github repositories of both festival and speech-tools that include various back-ported build fixes. Versions 1.95 and earlier require older systems with gcc 2.95 -- I have built these in a Debian Woody chroot.
Once #17 is merged, we will be able to address this issue adding/correcting the lexicon.
The lexicon we are currently using has part of speech (POS) information for some words. This POS information can be used to disambiguate the pronunciation of words. For instance: (live as verb: "I live here" vs. live as noun: "The post office will not ship live animals."). More recent versions of the cmu_dict do not have POS information:
LIVE L AY1 V
LIVE(2) L IH1 V
I am a bit concerned about how this lack of POS information can affect mimic's ability to resolve homograph ambiguities.
My current idea is to add all the new cmudict words using a 'missing' POS field (there is no problem on that). Alternative pronunciations will be automatically discarded and if bugs arise we will see how to deal with them. Our "base dictionary" will still be our current dictionary, so any word that already has POS information (such as "live") will not lose it.
If anyone is aware of a free lexicon with POS information and phonetic transcriptions, suggestions are welcome.
Regarding updating the dictionary:
cmudict-0.4.diff
contains the changes made to the base cmudict-0.4.scm
generated from the cmudict
source file.I have a cmudict-tools project that can be used to help maintain the pronunciations. This has the ability to generate festival format dictionaries (e.g. cmudict-tools --format festival print cmudict
). This uses the value in brackets ((2)
in your LIVE
example) in the POS field, so in cmudict markup this would be:
LIVE(n) L AY1 V
LIVE(v) L IH1 V
There is a potential license conflict between the festlex_CMU
changes and the changes made to the cmudict
file after version 0.6d. This is because the COPYING
file in festlex_CMU
contains the requirement:
3. Original authors' names are not deleted.
and the current maintainer of the cmudict
(Alex Rudnicky) removed the original header that referenced authorship to Bob Weide (the original maintainer) which was first added in version 0.2. Additionally, the cmudict-0.4.scm
file from festlex_CMU
preserves that header, whereas cmudict-0.4.out
does not (albeit with the text converted to lower case).
cmudict
versions 0.1 to 0.7 are available in the Public Domain. Versions 0.5 and 0.7 don't have an official release, but Alex Rudnicky created a reconstructed version in cmusphinx commit 7825 which I have tagged as cmudict-0.7
. Versions after this have been released under a 2-clause BSD license (source and binary distributions must retain the copyright notice and license text). I don't know how compatible these are with the changes made in the festlex_CMU
files (POS tags and additional words).@rhdunn pretty interesting stuff and will probably be useful. I've tested it very briefly and I might be using it wrong but it didn't accept the command line you gave I used ./cmudict-tools --format festlex print [dict]
, I got the following message
cmudict-tools: error: argument --format: invalid choice: 'festival' (choose from 'festlex', 'cmudict-weide', 'sphinx', 'cmudict', 'cmudict-new', 'json')
to get it running I used the festlex option. This in turn seem to have made the format of the output differ slightly from cmudict-0.4.out
found in festlex_CMU.tar.gz
used with make_lex
.
For example chair in 0.4:
("chair" nil (((ch eh r) 1)))
generated from your cmudict repo with cmudict-tools
("chair" nil (ch eh1 r))
This is what mimic produces after make_cmulex
so it might be all right, it's just a bit confusing for people like me who generally don't know what's going on =) (I need to find a good write down of all this and read through it).
Is festlex the flag you meant or is there another flag that I'm missing?
@rhdunn using your dictionary looks great!
@forslund, when make_cmulex calls the python script I wrote, the syllable structure is flattened following what festival did.
@forslund Yes, festlex
is the flag I meant. I also meant that it generates the cmudict-0.4.scm
format. Both have the form:
("word" pos (pronunciation))
The .scm
version (which cmudict-tools
generates) is a direct phoneme replacement for phonemes in cmudict
(with the addition of using ax
for ah0
). The .out
version groups phonemes based on the syllables, and pronunciation
has the form:
((pronunciation) stress) ... ((pronunciation) stress)
with the vowel stress number moving to the syllable group.
If you look in the Makefile
for festlex_CMU.tar.gz
(festival/lib/dicts/cmu/Makefile
) the scm
to out
conversion is done by:
cmudict-0.4.out: cmudict-0.4.scm cmudict_extensions.scm
cat cmudict-0.4.scm cmudict_extensions.scm >all.scm
${ESTDIR}/../festival/bin/festival -b cmudict_compile.scm
rm -f all.scm
The cmudict_compile.scm
script is doing:
(load "cmulex.scm")
(lex.compile "all.scm" "cmudict-0.4.out")
which is what part of make_cmulex
is doing during the build, so you can run something similar if the .out
file is missing. Something like:
if [ ! -e cmudict-0.4.out ]
then
cat cmudict-0.4.scm cmudict_extensions.scm >all.scm
$FESTIVAL --heap 10000000 -b '(begin (load "cmulex.scm") (lex.compile "all.scm" "cmudict-0.4.out"))'
fi
Regarding documentation of the process, there is very sparse disjointed information about the process. I have built up my experience from trying to understand the code and searching for material online.
Hi,
I have created an American English Pronunciation Dictionary (AmEPD) based on cmudict 0.7 (the last Public Domain version of the dictionary). This includes:
CAT-1
), spelling based initialisms (IBM
) and hyphenated words (there are too many hyphenated word variants and hyphenated words will primarily only vary by stress);AX
for COMMA and AXR
for LETTER unstressed vowels.ATHEIST
noted in this issue.There is still a lot of cleanup and consistency checking to do, but this should be a useful starting point.
NOTE: The part of speech tags used here are different to the ones used by festival. The tags for AmEPD are described in the amepd.ttl file in the amepd project, while the festlex-CMU tags are described in the festlex.ttl file of my pos-tags project. The festlex tags are different to the wp39 and wp20 tags (also described in pos-tags) used by the festival TTS program.
Hi @rhdunn!
Sounds interesting, I'll try to convert it for mimic.
Meant to come back to you about the cmudict-0.7 but forgot. I created a branch using your cmu-dict repo and tool (see rhdunn-cmudict).
When testing we found that the change from ah0 to ax makes the prounciation slightly different, and kept the old dict for now. Is the difference intended or do we need to update the voices for this to sound ok?
Also some of the emphasis levels aren't supported by mimic (I reduced the ones to levels that were included in mimic). Do you have an opinion on how this should be handled?
@forslund Do you mean changing /AX/
to /AH0/
? The festlex dictionary replaces /AH0/
with /AX/
(see the cmu2ft
script in festlex-CMU
). Thus, the
is DH AX
in festlex and DH AH0
in cmudict. The cmudict does not have /AX/
and /AXR/
, while my amepd does. NOTE: festlex does not use /AXR/
.
The cmudict uses the/AH/
vowel is used for STRUT and commA words, and /ER/
for NURSE
and lettER
words. When festlex converts /AH0/
to /AX/
(and as transcribed in the cmudict), contrast in several words is lost (esp. for um-, un- and up- words).
For the stressed levels, 2
is used for secondary stress in the cmudict and is not present in festlex. From the cmu2ft
script, festlex is using stress level 1
for these phonemes. This can currently be done using tr 2 1
on the output of the conversion process. To be more robust, I should modify cmudict-tools
so the festvox
phoneset does not have secondary stress and maps it to primary stress (2 -> 1
).
NOTE: I will also need to modify the cmudict-tool
to handle part-of-speech. The "remove variants" command will currently strip the words containing POS information :(. It needs to be intelligent in which entry to select -- the way I have set up the amepd is for the first entries to be the common ones and the ones that should be used if no additional disambiguation is supported.
I should also add support for mapping between vocabularies, making it easier to map from the amepd context vocabulary to the festlex one.
Thanks @rhdunn! I had not seen the cmu2ft script! Your cmudict-tool makes our lives easier!
@rhdunn, oh dear... I mixed them up! That explains it... Rebuilding the cmulex using the cmudict 0.7 and cmu2ft instead of cmudict-tool + my manual conversion sounds better.
I'm gonna throw fortune at it to test more strings but so far so good.
I have updated the cmudict-tool
program so that the festlex
phonemes work like from the cmu2ft
script. Things still to support:
cainteoir
tagset used in my amepd to the festlex
tagset used in the festival cmudict (e.g. mapping det
to dt
). I am looking into this at the moment.I have the above working now with the latest cmudict-tools
, so you can run:
git clone git@github.com:rhdunn/amepd.git
cd amepd
git checkout amepd-0.1-1
cmudict-tools --format=festlex --output-context=festlex --remove-duplicate-contexts print cmudict > cmudict.scm
This will give you a cmudict.scm
file that is in the same format as cmudict-0.4.scm
, so should be usable by the mimic dictionary build process.
NOTE: Some entries cannot be disambiguated by part of speech alone, e.g.:
AXES(noun) AE1 K S IH0 Z #@@{ "root": "AXE" }@@
AXES(verb) AE1 K S IH0 Z #@@{ "root": "AXE" }@@
AXES(noun) AE1 K S IY0 Z #@@{ "root": "AXIS" }@@
so will look odd when in the festival format as only the first two of those entries will be included.
The dictionary contains fixes for the words reported in the initial summary of the issue above.
Cool! I'll try it out as soon as I get time (my best guess: tomorrow). Getting an updated dict and closing this issue would be great.
Also, I'll see if I can make a guide on how to update the dictionary using @zeehio's scripts together with your dict and tool.
Hopefully I will find some time for mimic this weekend.
Yesterday I realized that "mycroft" needs to be added to the dictionary.
Thanks for working hard on this!
Yeah, it might be a good idea to add Mycroft =)
I tested the amepd dictionary and make_cmulex_helper.py
seem to stumble on '
I'm not sure what's the correct way to handle these characters. @zeehio do you have any suggestion? (mimic may even strip the input text from all special characters making these entries hard to use without some serious rewriting.)
Removing all lines using the characters produce nice results, both penis and atheism is pronounced correctly. Need to test some more though.
@rhdunn is the upstream dictionary interested in keeping the pronounciation of "Mycroft" or should we keep that as a local patch? And Thanks for the hard work!
@forslund I will be adding Mycroft shortly in part of the updates I am making to the dictionary post 0.1.
I have checked and '
entries are not in the festival dictionaries. You can use:
grep -vF "'"
to filter out '
characters, i.e.:
cmudict-tools --format=festlex --output-context=festlex --remove-duplicate-contexts print cmudict | grep -vF "'" > cmudict.scm
Mimic/flite are handling this via 's
being classed as a possessive ending part of speech class. For example, using -pw
(print words):
$ bin/flite -pw -t "How is Sarah's dog?"
how is sarah 's dog
I used a similar but more complicated grep-line =)
Thanks for clearing up the 's issue, I'll just remove the lines involved using grep. I'm going to make a clean rebuild tomorrow and create a proper pull request so people can start trying out your dict!
Given that this has been merged already this issue can be closed.
Huge thanks to @rhdunn and @forslund for doing all the hard work!
Some words are pronounced incorrectly.
The two that come to mind in my testing are Atheism and Penis
Atheism is pronounced by Mimic
A thigh ism
Now Theism, Theist and Atheist are pronounced correctly so this is a bit puzzling why atheism is pronounced differently.
Penis is pronounced by mimic like the words
pen is
it should be pronounced like
pee nis
Now I tested this out with a few voices and had identical results.