aparrish / pronouncingpy

A simple interface for the CMU pronouncing dictionary
BSD 3-Clause "New" or "Revised" License
301 stars 42 forks source link

extra non-phones in phones for several words #49

Open aparrish opened 5 years ago

aparrish commented 5 years ago

A handful of words have extra non-phone content in their pronunciations, e.g. [(k, v) for k, v in pr.pronunciations if '#' in v] evaluates to...

[("d'artagnan", 'D AH0 R T AE1 NG Y AH0 N # foreign french'),
 ('danglar', 'D AH0 NG L AA1 R # foreign french'),
 ('danglars', 'D AH0 NG L AA1 R Z # foreign french'),
 ('gdp', 'G IY1 D IY1 P IY1 # abbrev'),
 ('hiv', 'EY1 CH AY1 V IY1 # abbrev'),
 ('porthos', 'P AO0 R T AO1 S # foreign french'),
 ('spieth', 'S P IY1 TH # name'),
 ('spieth', 'S P AY1 AH0 TH # old')]

This should obviously not be the case! There may be other instances like this—I haven't had time to check. I imagine it's a problem with the upstream module providing the pronunciations.

hugovk commented 4 years ago

I see @davidlday has fixed cmudict in https://github.com/prosegrinder/python-cmudict/pull/14, released in the latest cmudict 0.4.3. Now, calling cmudict.entries() returns the data without comments. 👍

However, the problem remains in Pronouncing because it's loading the data file directly and doesn't post-process out the comments.

Here's a quick hack to fix Pronouncing:

diff --git a/pronouncing/__init__.py b/pronouncing/__init__.py
index b5f8d0e..20ceddb 100755
--- a/pronouncing/__init__.py
+++ b/pronouncing/__init__.py
@@ -47,15 +47,14 @@ def init_cmu(filehandle=None):
     """
     global pronunciations, lookup, rhyme_lookup
     if pronunciations is None:
-        if filehandle is None:
-            filehandle = cmudict.dict_stream()
-        pronunciations = parse_cmu(filehandle)
-        filehandle.close()
+        pronunciations = cmudict.entries()
         lookup = collections.defaultdict(list)
         for word, phones in pronunciations:
+            phones = " ".join(phones)
             lookup[word].append(phones)
         rhyme_lookup = collections.defaultdict(list)
         for word, phones in pronunciations:
+            phones = " ".join(phones)
             rp = rhyming_part(phones)
             if rp is not None:
                 rhyme_lookup[rp].append(word)

A couple of issues with this quick hack.

  1. It changes the API of init_cmu. Or rather, it ignores the filehandle parameter completely. Perhaps that's fine if the file duties are delegated to cmudict. (See also parse_cmu which takes a file handle.)

  2. Different returns means extra processing, possibly a performance hit:

    • cmudict.entries() returns a (str, list) tuple (eg. 'bout ['B', 'AW1', 'T'])
    • parse_cmu() returns a (str, str) tuple (eg. 'bout B AW1 T)
aparrish commented 4 years ago

Hmm, I think I'd prefer a solution that retains the ability to load custom cmudict-formatted data directly—I have used this feature a handful of times in my own projects.

hugovk commented 4 years ago

Please see PR #53 to strip comments and retain the API.