Feature: Generate text to speech on demand

TaylorN15 commented 1 month ago

It seems that TTS is run for all answers automatically, and users are given the option to play the stream. Most of my users wouldn't use the feature, so we would be unnecessarily calling the Speech Service API in many cases. I'd like to be able to call this on demand only. I.e. a user clicks the play button on a message, and then it calls the API to synthesize the speech.

pamelafox commented 1 month ago

I think we did this so that the experience feels fast when you actually click the button, but that's true that it's wasteful if most users aren't using it.

We need to decide if this is something all developers want, or if it's yet another option. I'm fine with making the change across the board, given that we should generally lean towards not wasting computational resources unnecessarily.

I don't think the speech icon has a proper loading state currently, just a disabled/enabled, if I remember correctly, so it would need a loading state.

@john0isaac Any interest in taking this on, given your familarity with the TTS features?

john0isaac commented 1 month ago

sure. @TaylorN15 I have a question to make sure I understand you correctly, when you enable the USE_SPEECH_OUTPUT_BROWSER feature which relies on the web speech api, you don't want to make any request to the web speech api unless the speaker button is clicked right? If this is what you want, I will see what I can do.

TaylorN15 commented 1 month ago

@john0isaac - I mean in either case (web speech API or Azure Speech). I prefer using the Azure Speech SDK as I can use neural voices. But I don't want to call the API pre-emptively and generate TTS for all the answers, if the user isn't even going to use the feature, as this will incur unnecessary costs. Right now, I have my own version implemented using react-text-to-speech and useSpeech so when a user clicks on a "Play" button under the response, it will render and play using the web speech API. Basically I'd like this same functionality, but using Azure Speech.

john0isaac commented 1 month ago

It was implemented like that because the answer card is a component that's why we pre-populate it with everything to not maintain any state about it moving forward in the conversation. Let me see what I can do, do you mind sharing your code here?

TaylorN15 commented 1 month ago

My code is highly customised. But here's the gist.

I've created an AnswerOptionsPanel component with some buttons/actions at the bottom of each Answer.

interface Props {
...
handleSpeechAction: (action: 'play' | 'pause' | 'stop' | 'restart') => void;
speechStatus: SpeechStatus;
...
}
...
{speechStatus === 'started' ? (
    <>
        <div
            className={isStreaming ? styles.disabled : ''}
            onClick={isStreaming ? undefined : () => handleSpeechAction("pause")}
            title="Pause">
            <Pause16Filled className={styles.adjustInputIcon} />
        </div>
        <div
            className={isStreaming ? styles.disabled : ''}
            onClick={isStreaming ? undefined : () => handleSpeechAction("stop")}
            title="Stop">
            <Stop16Filled className={styles.adjustInputIcon} />
        </div>
    </>
) : speechStatus === 'paused' ? (
    <>
        <div
            className={isStreaming ? styles.disabled : ''}
            onClick={isStreaming ? undefined : () => handleSpeechAction("play")}
            title="Play">
            <Play16Filled className={styles.adjustInputIcon} />
        </div>
        <div
            className={isStreaming ? styles.disabled : ''}
            onClick={isStreaming ? undefined : () => handleSpeechAction("stop")}
            title="Stop">
            <Stop16Filled className={styles.adjustInputIcon} />
        </div>
    </>
) : (
    <div
        className={isStreaming ? styles.disabled : ''}
        onClick={isStreaming ? undefined : () => handleSpeechAction("play")}
        title="Play">
        <Play16Filled className={styles.adjustInputIcon} />
    </div>
)}

Then in Answer.tsx

    const [textForSpeech, setTextForSpeech] = useState<string>("");
    const [ttsAction, setTtsAction] = useState<string>('');
    const { speechStatus, start, stop, pause } = useSpeech({
        text: textForSpeech,
        pitch: 1,
        rate: 1,
        volume: 1,
        lang: "en-AU",
        voiceURI: "Microsoft James - English (Australia)",
    });

    const handleSpeechAction = async (action: "play" | "pause" | "stop" | "restart") => {
        setTtsAction(action);
        const plainText = answerRef.current?.textContent || "";

        switch (action) {
            case "play":
                setTextForSpeech(plainText);
                break;
            case "pause":
                pause();
                break;
            case "stop":
                stop();
                break;
            case "restart":
                stop();
                setTtsAction("play");
                break;
            default:
                break;
        }
    };

    useEffect(() => {
        if (ttsAction === "play" && textForSpeech) {
            start();
            setTtsAction(''); // Reset ttsAction after starting
        }
    }, [textForSpeech, ttsAction]);

john0isaac commented 1 month ago

I traced the code and I think that you are definitely speaking about the AzureSpeech service as the WebSpeechAPI is never called unless you click on the button. I will see what I can do for the Azure Speech option.

TaylorN15 commented 1 month ago

My code is quite different to this codebase. The sample I provided is used to generate TTS only when you press the play button. I'm hoping for an option like that but using Azure Speech instead.

john0isaac commented 1 month ago

I meant this repo's code not your code. I was updating the issue's status that now the only required change is for the Azure Speech Service implementation.

Azure-Samples / azure-search-openai-demo

Feature: Generate text to speech on demand #1892