Open twocs opened 7 years ago
Thanks for your information. All sources above look great.
My focus is to build free pronunication dictionaries for multiple languages, not just for English. The reason I prefer IPA to Arpabet is simply the former seems more convenient for any languages. I don't particularly stick to Wiktionary.
Does CMU present pronouncing dictionaries other than English? If so, please let us know.
I haven't thought about the differentiation of different dialects of English. It's definitely important for some use, though.
I used this one: http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiktionary/20170220/enwiktionary-20170220-pages-articles-multistream.xml.bz2. It's not so small, but not too big. We can handle it because my code processes one entry at a time (and removes it before moving on to the next entry).
FYI, my spin-off project focuses on disambiguating heternoyms such as project
, minute
, object
, segment
, etc. in context. We will be happy to make the results/data publicly available. Stay tuned if you're interested.
CMU dictionary uses Arpabet but it's directly mapped to IPA. Is that interesting for this project? Or is the scope of this project really the specific focus of representing WIktionary pronunciations in IPA?
If it's to produce many different pronouncing dictionaries in IPA, I might suggest some of the following: https://github.com/wwesantos/arpabet-to-ipa http://people.umass.edu/nconstan/CMU-IPA/ https://github.com/matthewmorrone1/cmudict-ipa
Alternatively, if the scope is restricted to Wiktionary, then I'd also expect dictionaries for different Englishes. How to deal with the different accents (e.g. US vs British) and which files do you actually use, e.g. from http://ftp.acc.umu.se/mirror/wikimedia.org/dumps/enwiktionary/20170220/ I would hope to use one of the smaller files rather than the entire 2.5 GB xml file. Is it possible?