bootphon / phonemizer

Simple text to phones converter for multiple languages
https://bootphon.github.io/phonemizer/
GNU General Public License v3.0
1.21k stars 168 forks source link

Do this phonemizer support mixed language? #156

Closed JohnHerry closed 12 months ago

JohnHerry commented 12 months ago

Is your feature request related to a problem? Please describe. Is this phonemizer support language-mixed input? eg. "我想买一部iphone。"

Describe the solution you'd like the desired output of IPA phonemes of this sentence, and make promission that thers is no syllable conflict.

Describe alternatives you've considered

Additional context We also would like that there is a map between each of input characters and its IPAs. eg: {"我": [IPA list of 我], "iphone": [IPA list of iphone]}

mmmaat commented 12 months ago

Hi, phonemizer (with the espeak backend) can detect language switches mostly to English. But this is quite limited as you cannot specify which are languages, or which part of the text is in which language. See https://bootphon.github.io/phonemizer/api_reference.html, language_switch option.

$ echo '我想买一部iphone。' | phonemize -l cmn -b espeak -w '; '
[WARNING] 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "cmn" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)
[WARNING] words count mismatch on 100.0% of the lines (1/1)
wo2; ɕiɑ2ŋ; mai2; ji5; pu5; (en)aɪfəʊn(zh);

For the mapping word -> IPA, this is not implemented but already a feature request, see #96.

JohnHerry commented 12 months ago

Hi, phonemizer (with the espeak backend) can detect language switches mostly to English. But this is quite limited as you cannot specify which are languages, or which part of the text is in which language. See https://bootphon.github.io/phonemizer/api_reference.html, language_switch option.

$ echo '我想买一部iphone。' | phonemize -l cmn -b espeak -w '; '
[WARNING] 1 utterances containing language switches on lines 1
[WARNING] extra phones may appear in the "cmn" phoneset
[WARNING] language switch flags have been kept (applying "keep-flags" policy)
[WARNING] words count mismatch on 100.0% of the lines (1/1)
wo2; ɕiɑ2ŋ; mai2; ji5; pu5; (en)aɪfəʊn(zh);

For the mapping word -> IPA, this is not implemented but already a feature request, see #96.

Thanks for the help. by the way, In the output IPAs of the example, I guess it may contains the Tone symbols. but it looks strange. the output of the two character 一部( ji5; pu5;) have the same tone "5;", but as a Mandarin native, I think they should be not. Is there any bug in the relative module?

And I have another question, Is there any way to got the full alphabeta of IPAs? we would like an IPA alphabeta desigin that support multi-lingual expression.

The third quesion, How did the phonemizer process the polyphone problem? There are a lot of multi-PinYin characters in Mandarin characters. the truly PinYin is desided by the text context where the character is in. eg: character "着" in the context "走", its PinYin is "zhe", but when in "火", its PinYin is "zhao", I thinks they should also be different with IPA transcription, How did the phonemeizer process this problem? with a LM based prediction?

mmmaat commented 12 months ago

Your questions are all related to the espeak-ng backend, not phonemizer itself, which is a "simple" wrapper. Please go there to look for answers. For example https://github.com/espeak-ng/espeak-ng/issues?q=mandarin and https://github.com/espeak-ng/espeak-ng/blob/master/dictsource/cmn_list. Best.

JohnHerry commented 12 months ago

Thank you very much