Open aparrish opened 5 years ago
I see @davidlday has fixed cmudict in https://github.com/prosegrinder/python-cmudict/pull/14, released in the latest cmudict 0.4.3. Now, calling cmudict.entries()
returns the data without comments. 👍
However, the problem remains in Pronouncing because it's loading the data file directly and doesn't post-process out the comments.
Here's a quick hack to fix Pronouncing:
diff --git a/pronouncing/__init__.py b/pronouncing/__init__.py
index b5f8d0e..20ceddb 100755
--- a/pronouncing/__init__.py
+++ b/pronouncing/__init__.py
@@ -47,15 +47,14 @@ def init_cmu(filehandle=None):
"""
global pronunciations, lookup, rhyme_lookup
if pronunciations is None:
- if filehandle is None:
- filehandle = cmudict.dict_stream()
- pronunciations = parse_cmu(filehandle)
- filehandle.close()
+ pronunciations = cmudict.entries()
lookup = collections.defaultdict(list)
for word, phones in pronunciations:
+ phones = " ".join(phones)
lookup[word].append(phones)
rhyme_lookup = collections.defaultdict(list)
for word, phones in pronunciations:
+ phones = " ".join(phones)
rp = rhyming_part(phones)
if rp is not None:
rhyme_lookup[rp].append(word)
A couple of issues with this quick hack.
It changes the API of init_cmu
. Or rather, it ignores the filehandle
parameter completely. Perhaps that's fine if the file duties are delegated to cmudict. (See also parse_cmu
which takes a file handle.)
Different returns means extra processing, possibly a performance hit:
cmudict.entries()
returns a (str, list)
tuple (eg. 'bout ['B', 'AW1', 'T']
)parse_cmu()
returns a (str, str)
tuple (eg. 'bout B AW1 T
)Hmm, I think I'd prefer a solution that retains the ability to load custom cmudict-formatted data directly—I have used this feature a handful of times in my own projects.
Please see PR #53 to strip comments and retain the API.
A handful of words have extra non-phone content in their pronunciations, e.g.
[(k, v) for k, v in pr.pronunciations if '#' in v]
evaluates to...This should obviously not be the case! There may be other instances like this—I haven't had time to check. I imagine it's a problem with the upstream module providing the pronunciations.