Azure-Samples / cognitive-services-speech-sdk

Sample code for the Microsoft Cognitive Services Speech SDK
MIT License
2.85k stars 1.84k forks source link

Custom Chinese lexicon is not adopted by SpeakSsmlAsync #1981

Closed AndrewLang closed 1 year ago

AndrewLang commented 1 year ago

Describe the bug In Chinese, a word could have different pronunciation based on the context, so I created a custom lexicon to correct the pronunciation. The lexicon file is stored in Azure storage and can be accessed in public, the url is embedded in the ssml content correctly, generate auto with the SpeakSsmlAsyml method, there is no change to the audio

To Reproduce Steps to reproduce the behavior:

  1. Create a SpeechSynthesizer instance with following configuration VoiceName = "zh-CN-YunzeNeural", Language = "zh-CN",

  2. Call SpeakSsmlAsync using var result = await synthesizer.SpeakSsmlAsync(ssml); with following SSML content.

    <speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="zh-CN" >
    <voice name="zh-CN-YunzeNeural">
    <lexicon uri="https://matrixreader.blob.core.windows.net/public/lexicon.xml" />
    任我行冷笑道, 剑指小腹,这个小姑娘。姊姊还好吗?
    任我行大声道:你们这些人,都是我的手下败将,还不束手就擒!
    </voice>
    </speak>
  3. Save the audio content to a file

  4. Listen to the audio, the word "任我行" has different pronunciation.

  5. Also tested my lexicon file in the Speech Studio, the pronunciation is correct. So it could be something wrong in SDK.

Expected behavior Expect the the lexicon is adopted by sdk and pronunciation is correct.

Version of the Cognitive Services Speech SDK Version 1.27.0

Platform, Operating System, and Programming Language

Additional context

yulin-li commented 1 year ago

@jiajzhan could you help to check?

BrianMouncer commented 1 year ago

@AndrewLang Can you please provide an example of your Lexicon and your SSML that references it? Thanks,

AndrewLang commented 1 year ago

@BrianMouncer here is my ssml content, the lexicon is in it.

<speak xmlns="http://www.w3.org/2001/10/synthesis" xmlns:mstts="http://www.w3.org/2001/mstts" xmlns:emo="http://www.w3.org/2009/10/emotionml" version="1.0" xml:lang="zh-CN" >
<voice name="zh-CN-YunzeNeural">
<lexicon uri="https://matrixreader.blob.core.windows.net/public/lexicon.xml" />
任我行冷笑道, 剑指小腹,这个小姑娘。姊姊还好吗?
任我行大声道:你们这些人,都是我的手下败将,还不束手就擒!
重重的击了一拳。
藏灵上人也真了得,受了内伤。
老人家有点不舒服,是什么病?请的是哪位大夫?
小姑娘,你是哪里人?你叫什么名字?你家里人知道你在这儿吗?

</voice>
</speak>
jiajzhan commented 1 year ago

I am investigating. Thanks

AndrewLang commented 1 year ago

@jiajzhan any updates?

jiajzhan commented 1 year ago

Update alphabet to 'sapi' as you are using pinyin in ZhCN. Besides, we have custom lexicon validation tool on https://github.com/Azure-Samples/Cognitive-Speech-TTS/tree/master/CustomLexiconValidation

AndrewLang commented 1 year ago

@jiajzhan even I set it as sapi, it doesn't work as expected, The pronunciation is same to the audio without the lexicon.

jiajzhan commented 1 year ago

Try again, it should work now. When lexicon content changed, it will take at least 15 mins to get the latest content.

AndrewLang commented 1 year ago

@jiajzhan I gave it overnight, and tried it again, it doesn't look like the lexicon was picked up.

AndrewLang commented 1 year ago

@jiajzhan any suggestions can make it work?

jiajzhan commented 1 year ago

"任我行" has different pronunciation now when using your latest lexicon, I tried on YunzeNeural. So which words are still not working as your expectation?

AndrewLang commented 1 year ago

@jiajzhan, with my lexicon, "大夫" should be "dai 4 fu 1", "姊姊" should be "jie 3 jie 3", "藏灵" should be "zhang 4 ling 2", "了得" should be "liao 3 de 2"...

The problem is the lexicon is NOT adopted.

I did a test in Speech Studio, same content, same lexicon, the pronunciation is correct.

image
jiajzhan commented 1 year ago

"大夫" is not working with custom lexicon, we are investigating, others works well on my local using Yunze voice

jiajzhan commented 1 year ago

"姊姊" should be "jie 3 jie 3", "藏灵" should be "zhang 4 ling 2", "了得" should be "liao 3 de 2". These works OK on my local, so you are still using SSML you shared above?

jiajzhan commented 1 year ago

please set pronunciation for '哪位大夫' with 'na 3 wei 4 dai 4 fu 1', then it will work

AndrewLang commented 1 year ago

@jiajzhan interesting, did you use the lexicon file url https://matrixreader.blob.core.windows.net/public/lexicon.xml? Is there other settings I need to set, no matter how hard I try, it doesn't work on my side. Can you share your code?

AndrewLang commented 1 year ago

BTW, my service region is eastus, could that be a problem?

jiajzhan commented 1 year ago

Region should not be the problem, I will share my code soon.

jiajzhan commented 1 year ago
public static async Task CustomLexiconRequest()
{
            var speechConfig = SpeechConfig.FromSubscription(subscriptionKey, subscriptionRegion);

            speechConfig.SpeechSynthesisVoiceName = "Microsoft Server Speech Text to Speech Voice (zh-CN, YunyeNeural)";
            string fileName = "SpeechSynthesisOutputCustomLexicon.wav";

            var fileOutput = AudioConfig.FromWavFileOutput(fileName);
            Console.OutputEncoding = Encoding.UTF8;

            using (var speechSynthesizer = new SpeechSynthesizer(speechConfig, fileOutput))
            {
                string text = "<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xmlns:mstts='https://www.w3.org/2001/mstts' xml:lang='zh-CN'><voice xml:lang='zh-CN' xml:gender='Male' name='Microsoft Server Speech Text to Speech Voice (zh-CN, YunyeNeural)'><lexicon uri='https://matrixreader.blob.core.windows.net/public/lexicon.xml'/>" +
                    "任我行冷笑道, 剑指小腹,这个小姑娘。 姊姊还好吗?</voice></speak>";

                var speechSynthesisResult = await speechSynthesizer.SpeakSsmlAsync(text);

                OutputSpeechSynthesisResult(speechSynthesisResult, text);
            }

            Console.WriteLine("Press any key to exit...");
            Console.ReadKey();
}

Try this code?

AndrewLang commented 1 year ago

I see your code usesed a different voice name YunYe instead of YunzeNeural. So, it seems linke YunzeNeural doesn't support custom lexicon?

jiajzhan commented 1 year ago

oh, it's a mistake, so have you tried Yunyeneural for your case, does it work?

AndrewLang commented 1 year ago

Checked, doesn't work as expected. So far, I only see the custom lexicon work in Speech Studio. There is no useful information to debug it.

jiajzhan commented 1 year ago

could you share your code/project into a zip to me?

AndrewLang commented 1 year ago

My project is relative complex, there are many services, it doesn't help to diagnose it. My test code is pretty straight forward, here is the main code.


class Program
{
    public static async Task Main()
    {
        Console.OutputEncoding = Encoding.UTF8;
        var subscriptionKey = "";
        var serviceRegiion = "eastus";
        var voiceName = "zh-CN-YunzeNeural";//  "zh-CN-henan-YundengNeural"; // "zh-CN-liaoning-XiaobeiNeural"; // *** "zh-CN-YunzeNeural"; // ** "zh-CN-YunyeNeural"; //** "zh-CN-YunyangNeural";//"zh-CN-YunxiNeural"; // ** "zh-CN-YunjianNeural"; // "zh-CN-YunhaoNeural";// "zh-CN-YunfengNeural";// "zh-HK-DannyNeural"; 
        var language = "zh-CN";//  "zh-CN"; //"zh-CN-henan"; //  "zh-CN-liaoning"

        var config = SpeechConfig.FromSubscription(subscriptionKey, serviceRegiion);
        config.SpeechSynthesisVoiceName = voiceName;
        config.SpeechRecognitionLanguage = language;
        config.SpeechSynthesisLanguage = language;
        config.EnableAudioLogging();
        //config.EnableDictation();
        //config.SetProfanity(ProfanityOption.Removed);

        using var synthesizer = new SpeechSynthesizer(config, null);

        var ssml = LoadTestSsml();
        Console.WriteLine(ssml);

        using var result = await synthesizer.SpeakSsmlAsync(ssml);

        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
        {
            using var audioStream = AudioDataStream.FromResult(result);
            await audioStream.SaveToWaveFileAsync("output_with_lexicon.wav");
            Console.WriteLine("Audio file was written to file");
        }
        else
        {
            Console.WriteLine("");
            var detailed = SpeechSynthesisCancellationDetails.FromResult(result);
            Console.Write($"{result.Reason} by the service, {detailed.ErrorDetails}");
            Console.WriteLine("");
        }
    }

    private static string LoadTestSsml()
    {
        var file = "ssml.xml";

        if (File.Exists(file))
        {
            return File.ReadAllText(file);
        }

        return string.Empty;
    }
}
AndrewLang commented 1 year ago

@jiajzhan I created another test example with Node Js SDK, with this the lexicon was applied, but there is another problem it only generates audio around 11s, and I got error code 1007.

Since JS SDK works, it should be a problem with C# SDK. Is the SDK open source, I can help to check if it is.

jiajzhan commented 1 year ago

@AndrewLang I did not repo your issue using your code, on my local, the custom lexicon works well, my SDK version is 1.26.0

AndrewLang commented 1 year ago

@jiajzhan thanks for the info. I did more testing and I found if there are words seems not supported or well recognized, then it cause the whole lexicon file not adopted, at least it looks like this. Also, the error message is pretty confusing and not help for diagnose. If you can provide more insights that will be appreciated.

jiajzhan commented 1 year ago

Hi @AndrewLang I am on vacation previous days. For this: " if there are words seems not supported or well recognized, then it cause the whole lexicon file not adopted", it's correct. If one word is set a wrong pronunciation, the whole lexicon won't work.

jiajzhan commented 1 year ago

So are you testing with lexicon: https://matrixreader.blob.core.windows.net/public/lexicon.xml ? I test this lexicon before and its working.

AndrewLang commented 1 year ago

yes, that's the lexicon I used.

I tried to remove some of the words and test it one by one. From human perspective, those words are correct, how the service determine it's right or not? For example, 大夫 in some cases should be read as "dai 4 fu 1", but it's never picked up.

jiajzhan commented 1 year ago

The lexicon is working well on my local, is this still a issue for you? If so, do we have a way to set up a temp meeting to discuss this?

AndrewLang commented 1 year ago

@jiajzhan thanks for your help. I think it works so far. The lexicon is tricky and there is not much document for it, especially for Chinese.

M-Hietala commented 1 year ago

From the latest comment I understand this issue can be closed now.