VRCWizard / TTS-Voice-Wizard

Speech to Text to Speech. Song now playing. Sends text as OSC messages to VRChat to display on avatar. (STTTS) (Speech to TTS) (VRC STT System) (VTuber TTS)
https://TTSVoiceWizard.com
MIT License
574 stars 66 forks source link

[Feature Request] STT (Whisper) #46

Closed chrisk414 closed 11 months ago

chrisk414 commented 1 year ago

Hi, I have a couple of feature requests that can help working with STT easier.

  1. I like to change my chatting language quickly to talk to someone with different languages. Right now, in order to change the language, I have to restart STT (click on "Speech to Text to Speech"). This has a couple of problems. Click on "Speech to Text to Speech" quickly can cause crashes, and restarting STT can take bit of time to load ggml and it's a bit unstable when it starts. Therefore, it would be nice to change "language" without restarting STT. Ggml should be loaded only once, or only when it's changed, I think.

  2. I would like to temporarily "pause" capturing voice without stopping STT. Restarting STT can be problematic as I mentioned above. Is it possible to add a Toggle Button right next to "Speech to Text to Speech" to temporarily stop capturing voice? This way, I can control STT not to go crazy and STT can have time to clear up the buffer when not capturing.

Please let me know if you have a better idea.

Cheers!

VRCWizard commented 1 year ago
  1. I'll have to look into if you can change the Whisper parameters while it's running

  2. Added (sort of). There does not seem to be an easy way to actually pause Whisper and mute audio capture. Although the text output can be stopped. So your audio is still being captured and processed just the results are not output.

https://github.com/VRCWizard/TTS-Voice-Wizard/releases/tag/v1.5.1

chrisk414 commented 1 year ago

1. Yup, that will be super. Theoretically, ggml should be loaded once unless it's changed. Hope that can help make the app super snappy. ^^

2 Is this a limitation on the Whisher lib itself? I hope we don't have to touch Whisper lib itself to temporarily stop capturing.

BTW, there seem to be two versions of Whisper libs, one, Cpp, and the other, Net. I suppose Cpp is faster but Net is easier to work with. I did a quick test on the Net version and it seems to be quite fast for short sentences. Unless you are transcribing a movie, the perf difference will not matter much in my opinion.
It might be worthwhile to consider the Net version if the Cpp version gives us headaches. My two cents. ^^

chrisk414 commented 1 year ago

Update:

1: I did a quick implementation for changing STT language on the fly.

The trick is to recreate the CaptureThread. I'm not certain if this is the right way to change the language. But hey, it works and I can change the language without restarting STT and it's quite nice.

Please take a look at what I did and I hope it makes it to the next patch with better implementation.

Find "chrisk" and it's where I made the changes. I only made changes for Whisper as I don't know how it works on others. FYI, it will need some safeguards when it can change the language. It will error trying to change the language when STT is not started.

from VoiceWizardWindows.cs

  {
      if (comboBoxTranslationLanguage.SelectedItem != null && comboBoxSpokenLanguage.SelectedItem != null)
      {
          // Get the language code from the selected spoken language
          string spokenLanguageCode = comboBoxSpokenLanguage.SelectedItem.ToString().Substring(0, comboBoxSpokenLanguage.SelectedItem.ToString().IndexOf(' '));

          // Get the language code from the selected translation language
          string translationLanguageCode = comboBoxTranslationLanguage.SelectedItem.ToString().Substring(0, comboBoxTranslationLanguage.SelectedItem.ToString().IndexOf(' '));

          // Check if the selected spoken language is the same as the selected translation language
          if (spokenLanguageCode == translationLanguageCode)
          {
              // Set the translation language to position 0 (no translation)
              comboBoxTranslationLanguage.SelectedIndex = 0;
          }

          // added: change language -chrisk
          switch (comboBoxSTT.Text.ToString())
          {
              case "Whisper":
                  WhisperRecognition.setLanguage(comboBoxSpokenLanguage.SelectedItem.ToString());
                  break;
          }
      }
  }

from WhisperRecognition.cs

// changed to static here -chrisk
        static CommandLineArgs cla;
        static Whisper.Context context;
        static iAudioCapture captureDev;
        static CaptureThread thread;

        // added: set language -chrisk
        public static void setLanguage(string language)
        {
            fromLanguageID(language);
            eLanguage? elang = Library.languageFromCode(langcode);
            if (elang != null)
            {
                CaptureThread.stopWhisper();
                context.parameters.language = (eLanguage)elang;

                new CaptureThread(cla, context, captureDev);
                thread.join();
            }
        }

        public static int doWhisper(string[] args)
        {
            try
            {
                try
                {
                    cla = new CommandLineArgs(args); // modified -chrisk

                }
                catch (OperationCanceledException)
                {
                    return 1;
                }

                using iMediaFoundation mf = Library.initMediaFoundation();
                CaptureDeviceId[] devices = mf.listCaptureDevices() ??
                    throw new ApplicationException("This computer has no audio capture devices");

                if (cla.captureDeviceIndex < 0 || cla.captureDeviceIndex >= devices.Length)
                    throw new ApplicationException($"Capture device index is out of range; the valid range is [ 0 .. {devices.Length - 1} ]");

                sCaptureParams cp = new sCaptureParams();
                try
                {

                    cp.minDuration = (float)Convert.ToDouble(VoiceWizardWindow.MainFormGlobal.textBoxWhisperMinDuration.Text.ToString(), CultureInfo.InvariantCulture); //1
                    cp.maxDuration = (float)Convert.ToDouble(VoiceWizardWindow.MainFormGlobal.textBoxWhisperMaxDuration.Text.ToString(), CultureInfo.InvariantCulture); //8
                    cp.dropStartSilence = (float)Convert.ToDouble(VoiceWizardWindow.MainFormGlobal.textBoxWhisperDropSilence.Text.ToString(), CultureInfo.InvariantCulture);   // 250 ms
                    cp.pauseDuration = (float)Convert.ToDouble(VoiceWizardWindow.MainFormGlobal.textBoxWhisperPauseDuration.Text.ToString(), CultureInfo.InvariantCulture);  //1
                                                                                                                                                                             //we need culture invariant or for some languages like german 8.0 will be converted to 80 because they use "," instead of "."
                }
                catch (Exception ex)
                {
                    cp.minDuration = 1.0f;
                    cp.maxDuration = 8.0f;
                    cp.dropStartSilence = 0.25f;
                    cp.pauseDuration = 1.0f;
                    if (WhisperError == false)
                    {
                        OutputText.outputLog("[WARNING: Error Occured loading Whisper custom values. Forcing defaults]", System.Drawing.Color.DarkOrange);
                    }
                    WhisperError = true;
                    VoiceWizardWindow.MainFormGlobal.Invoke((MethodInvoker)delegate ()
                    {
                        VoiceWizardWindow.MainFormGlobal.textBoxWhisperMinDuration.Text = "1.0";
                        VoiceWizardWindow.MainFormGlobal.textBoxWhisperMaxDuration.Text = "8.0";
                        VoiceWizardWindow.MainFormGlobal.textBoxWhisperDropSilence.Text = "0.25";
                        VoiceWizardWindow.MainFormGlobal.textBoxWhisperPauseDuration.Text = "1.0";
                    });

                }

                if (cla.diarize)
                    cp.flags |= eCaptureFlags.Stereo;
                captureDev = mf.openCaptureDevice(devices[cla.captureDeviceIndex], cp); // modified -chrisk

                using iModel model = Library.loadModel(cla.model);
                context = model.createContext();  // modified -chrisk

                cla.apply(ref context.parameters);
                thread = new CaptureThread(cla, context, captureDev);
                thread.join();

                //context.timingsPrint();
                Debug.WriteLine("Whisper finished");
                return 0;
            }
            catch (Exception ex)
            {
                OutputText.outputLog("[Whisper Error: " + ex.Message.ToString() + "]", System.Drawing.Color.Red);
                OutputText.outputLog("[Whisper Setup Guide: https://github.com/VRCWizard/TTS-Voice-Wizard/wiki/Whisper ", System.Drawing.Color.DarkOrange);

                WhisperEnabled = false;

                if (VoiceWizardWindow.MainFormGlobal.rjToggleButtonOSC.Checked == true || VoiceWizardWindow.MainFormGlobal.rjToggleButtonChatBox.Checked == true)
                {
                    var sttListening = new OscMessage("/avatar/parameters/stt_listening", false);
                    OSC.OSCSender.Send(sttListening);
                }
                DoSpeech.speechToTextOffSound();

                return ex.HResult;
            }
        }
chrisk414 commented 1 year ago

Hi, how's it going? I was wondering if this is something that you are looking into. I would appreciate it if we can change STT parameters without reloading the model. Reloading the model can quite a while depending on the machines but I think it's should be loaded only once when starting up, (or changing the model itself). I would appreciate if it makes into the office tree so that I don't have to patch it every time I download new code. Many thanks. -chris

VRCWizard commented 1 year ago

The implementation is essentially the same as stopping and starting whisper. Majority of the start up time is in creating a new CaptureThread, this implementation does not circumvent that. I've done timed tests and the difference in start up times are marginal (less than a second).

With that being said, the benefit of this feature would be you wouldn't need to manually restart whisper after switching languages (it would be restarted automatically). I'll likely add it in the next update.

chrisk414 commented 1 year ago

Sure, it will definitely help make the experience better until we find what makes it so long to change seemingly simple parameters. Many thanks.