WICG / speech-api

Web Speech API
https://wicg.github.io/speech-api/
145 stars 30 forks source link

Support SpeechSynthesis *to* a MediaStreamTrack #69

Open cwilso opened 4 years ago

cwilso commented 4 years ago

It would be very helpful to be able to get a stream of the output of SpeechSynthesis.

For an explicit use cases, I would like to:

(This is similar/inverse/matching/related feature to #66.)

guest271314 commented 4 years ago

This is possible to an appreciable degree using the approach in https://github.com/guest271314/SpeechSynthesisRecorder. In order to make the change concrete and direct adjustments to the speechd socket connection (https://github.com/brailcom/speechd and to the degree necessary spd-conf in python3-speechd) can be made, see https://stackoverflow.com/questions/48219981/how-to-programmatically-send-a-unix-socket-command-to-a-system-server-autospawne.

The code

async function SSMLStream({ssml="", options=""}) {
  const fd = new FormData();
  fd.append("ssml", ssml);
  fd.append("options", options);

  const request = await fetch("speak.php", {method:"POST", body:fd});
  const response = await request.arrayBuffer();
  return response;
}

let ssml = `<speak version="1.0" xml:lang="en-US"> 
             Here are <say-as interpret-as="characters">SSML</say-as> samples. 
             Hello universe, how are you today? 
             Try a date: <say-as interpret-as="date" format="dmy" detail="1">10-9-1960</say-as> 
             This is a <break time="2500ms" /> 2.5 second pause. 
             This is a <break /> sentence break</prosody> <break />
             <voice name="us-en+f3" rate="x-slow" pitch="0.25">espeak using</voice> 
             PHP and <voice name="en-us+f2"> <sub alias="JavaScript">JS</sub></voice>
           </speak>`;

SSMLStream({ssml, options:"-v en-us+f1"})
.then(async(data) => {

    let context = new AudioContext();
    let source = context.createBufferSource();
    source.buffer = await context.decodeAudioData(data);
    source.connect(context.destination);
    source.start()

})
// PHP
<?php 
  if(isset($_POST["ssml"])) {
    header("Content-Type: audio/x-wav");
    $options = $_POST["options"];
    echo shell_exec("espeak -m --stdout " . $options . " '" . $_POST["ssml"] . "'");
  };

At the command line we can currently do

espeak-ng -m --stdout > output && ~/dataurl output

where dataurl is a bash script which converts a file to a data URL

guest271314 commented 4 years ago

Technically using navigator.mediaDevices.enumerateDevices() and selected "audiooutput" should achieve the requirement https://github.com/WebAudio/web-audio-api/issues/1764#issuecomment-531680541.

cwilso commented 4 years ago

1) I don't dispute that you could use other speech synthesis engines via web sockets or the like; this is explicitly about the Web Speech API interfaces.

2) You can't use "audiooutput" (it's an output device, not an input device - so this doesn't work in any implementation I know of)

3) That approach would include ALL the sounds currently being played through the output - which would explicitly defeat the scenarios I suggested.

guest271314 commented 4 years ago

@cwilso

  1. Web Speech API in fact uses the binary installed on the local machine. Meaning the API is calling espeak or espeak-ng via speechd anyway.

  2. Have you followed the instructions at https://github.com/WebAudio/web-audio-api/issues/1764#issuecomment-531680541? When you plu in the headphones there is no audio output to "speakers". When you check the system sound settings you will see that output is managed by the socket connection.

  3. Agree. The linked code is a workaround. To make the change the source code at browsers - both Mozilla and Chrome, Chromium utilize speechd (speech-dispatcher) - change the parameters set at the socket connection. And there SHOULD be a means to select the output of the speech engine, instead of any and all audio that is potentially being output by the browser. That appears to be what the related issue is describing, that is, for example, by setting the kind of the MediaStreamTrack to "speech" for speech synthesis and speech recognition, for disambiguation.

Kindly compose the specification to do just that so that these workarounds can be retired.

guest271314 commented 4 years ago

@cwilso BTW the maintainers of speechd are very astute and helpful. Given your pedigree am relatively certain they would assist suggesting the necessary changes that need to be made at the source code. The specification part is straightforward: provide the option to pipe audio output to a MediaStream.

cwilso commented 4 years ago

I think we're talking at cross purposes. If you think Web Speech should be built differently, dive in to that discussion - I think you're fundamentally saying "Web Speech shouldn't exist, we can already do this with speechd" - but is that really true across different OSes and systems? I think it would be good to have one, relatively simply API to do TTS. I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

"Plug in some headphones to avoid audio output" is not a usable expectation for users (just like "install a loopback driver" is not a realistic expectation either, for a bunch of scenarios people have asked for in Web Audio). I'm not entirely sure what your implication is here, because I'm saying precisely this - we need the ability to pipe the stream of audio data from a speech utterance to a Media Stream. Doing that through a Web Socket connection set up to speechd with required client code running in the UI thread creating buffersource nodes and decoding and start(0)'ing audio files as they come in seems like a roundabout way of doing this.

guest271314 commented 4 years ago

At *nix neither Chrome, Chromium nor Mozilla Firefox, Nightly implementations write their own "speech synthesis engines", no speech syntheis engine in included in the source code (Windows and Mac may be different here). Kindly read the above-linked SO question carefully while cross-referencing the source code of the respective browsers.

The implementations of Web Speech API at the former browsers rely entirely on there being "speech synthesis engines" already installed on the local machine which speech-dispatcher executes.

That means that when there is no speech synthesis engine installed locally Web Speech API alone does not perform any speech synthesis. AFAICT the specification does not currently mandate that either speech synthesis nor speech recognition MUST be performed locally. Web Speech API is not a speech engine itself.

The same is true for speech recognition, perhaps save for Android and iOS handheld devices.

To change which "speech synthesis engines" are used you can execute spd-conf to select Mary, Flite, espeak (usually shipped by default at *nix distributions) etc., whatever speech synthesis engines are installed locally.

I additionally am suggested here in this issue that you should be able to get a Media Stream of that output (rather than have it piped to audio output).

Agree, that should be an option, or as you appear to suggest, default.

"Plug in some headphones to avoid audio output"

That was merely stated to verify that what is being recorded is not the microphone, but rather, audio output. If you open sound setting at *nix while speak() is being called you can obsserve that. If you close Chromium you might even observe that the socket is still open!

Yes, a rooundabout way, though very possible. Native Messaging can also be utilized, to avoid having to interface with Web Speech API at all, as very little significant changes have been made since the specification was published. A WebSocket allows direct communication at any origin, e.g., at console and/or as a Snippet that can be run at any page.

A WebSocket approach which can be used to pipe output from calling the locally installed speech synthesis binary to a MediaStreamTrack https://medium.com/@martin.sikora/node-js-websocket-simple-chat-tutorial-2def3a841b61

Native Messaging requires loading the code at chrome: protocol.

If you are trying to actually fix the specification, write the words that will do just that.

If you are trying to achieve the requirement in spite of the current specification, options are available, e.g. at the front-end you can use meSpeak.js https://stackoverflow.com/questions/38727696/generate-audio-file-with-w3c-web-speech-api.

If you are trying to do both you can achieve the expected result while composing the PR.

guest271314 commented 4 years ago

You do not have to create a buffer source. You can connect the live captured MediaStreamTrack to a media stream destination and/or AudioWorkletNode. Again, that code was for demonstration purposes only.

Pehrsons commented 4 years ago

I'll just note it'd be more fitting to have a MediaStreamTrack be the output rather than a MediaStream, unless there are requirements that the stream must be exposed early on, and output (tracks) must come and go throughout its lifetime.

It seems to me that speak(utterance) could return a MediaStreamTrack. Or some variant on that to maintain backwards compatibility.

cwilso commented 4 years ago

@Pehrsons you're right, it would probably be more fitting to use MediaStreamTrack.

For the use cases I listed, I think it would make more sense to have a more long-lasting MediaStreamTrack than a single Utterance, and also it's critically important to NOT send that output to the main audio output as well. (E.g. this should be maybe an optional parameter to speak(), or a mode you set up via SpeechSynthesis.getMediaStreamTrack() (and release somehow when you're done)). Creating and destroying MediaStreamTracks for every utterance would seem to be costly and prone to causing audio artifacting.

cwilso commented 4 years ago

(Changed title to reflect @Pehrsons' suggestion.)

Pehrsons commented 4 years ago

I think a long-lived MediaStreamTrack is fine as long as there is something in SpeechSynthesis making it end eventually (i.e., so garbage collection of SpeechSynthesis cannot be observed through the track's ended event).

That said,

As an implementer of mediacapture APIs in Firefox I don't think having multiple tracks is prone to cause audio artifacting. If there was, that would be a bad implementation of playback of MediaStreamTracks.

Whether creating and destroying tracks for every utterance is costly depends on perspective I guess. How long would an utterance be? Are we talking one track per ten seconds or hundreds per second? I assume the former. Garbage collecting lots of objects can be noticable, but "lots" might have to be fairly high for that, even for a mobile device. Note: this is anecdotal, I don't have data to back it up.

Let's also not forget what performance impact a muted MediaStreamTrack might have. In Firefox it means we keep an audio stream open towards the OS because the track can become unmuted at any time (in other MediaStreamTrack APIs muted tends to be shortlived, i.e., it will be unmuted as soon as the connection is set up, the decoder has finished seeking, etc.). If there's a reference to (an idle) SpeechSynthesis object keeping the muted track alive, that might cause quite the power drain.

guest271314 commented 4 years ago

Until a speech synthesis engine is shipped with the browser and provides a means to get a MediaStreamTrack from the speech synthesis engine the following approach can be utilized at a Native Messaging host or using a WebSocket. At Chromium a WebSocket connection to the local file system allows client code to be saved in Sources => Snippets and run from any origin (e.g., chrome-search://local-ntp) by right-clicking the name of the snippet and then selecting Run. Native Messaging requires the code to be loaded as an extenstion and run as an "app" at the extenstion URL.

At Chrome or Chromium open DevTools, select Source, then select Snippets, click New snippet, then write the code in the center window, give the snippet a name, e.g., "ws-speak-mst"

const connection = new WebSocket("ws://127.0.0.1:8080", "echo-protocol");
connection.onmessage = async message => {
 try {            
   // message.data is a data URL
   const response = await (await fetch(message.data)).arrayBuffer();
   const ac = new AudioContext();
   const destination = ac.createMediaStreamDestination();
   const ab = await ac.decodeAudioData(response);
   const source = ac.createBufferSource();
   source.buffer = ab;
   source.connect(destination);
   source.connect(ac.destination);
   // MediaStreamTrack with media source being output from espeak-ng
   const [track] = destination.stream.getAudioTracks();
   // just to verify track is the outputting only the TTS audio
   const recorder = new MediaRecorder(new MediaStream([track]));
   recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
   source.start();
   recorder.start();
   // stop() track
   source.onended = _ => (track.stop(), track.enabled = false, recorder.stop());
   } catch (e) {
     console.error(e);
   }
}
// usage -w option writes WAV file instead of outputting audio to speakers
connection.send("espeak-ng -w speak.wav 'testing media stream track from espeak-ng'")

the local code can be PHP, Python, bash, or other preferred programming language. In general, the same code can be run using WebSocket, a local server, or Native Messaging. For this example nodejs is used, in pertinent part

let connection = request.accept("echo-protocol", request.origin);
connection.on("message", message => {
  require("child_process")
  // message.utf8Data: "espeak-ng -w speak.wav 'testing media stream track from espeak-ng'"
  .exec(message.utf8Data, (err, _, stderr) => {
    require("child_process")
    // convert .wav to .ogg with Vorbis codec (playable at Chromium, Firefox)
    // use FFmpeg, speex, etc. to convert WAV to the required codec, container  
    // send a data URL to the browser
    .exec("oggenc speak.wav -o speak.ogg && base64 speak.ogg", (err, stdout, stderr) => {
      connection.send(`data:audio/ogg;base64,${stdout}`);
    })
  })
})
guest271314 commented 4 years ago

@cwilso This should capture only audio ouput, precisely when speechSynthesis.speak() is executed, without capturing any microphone input.

(async() => {
  const sink = document.createElement("video");
  document.body.appendChild(sink);
  sink.controls = sink.autoplay = true;
  navigator.mediaDevices.ondevicechange = e => console.log(e);
  const devices = await navigator.mediaDevices.enumerateDevices();
  const {
    deviceId
  } = devices.find(({
    kind, label
  }) => kind === "audiooutput");
  console.log(devices);
  let stream = await navigator.mediaDevices.getUserMedia({
    audio: {
      deviceId: {
        exact: deviceId
      }
    }
  });
  sink.srcObject = stream;
  console.log(devices, deviceId);
  const text = [...Array(10).keys()].join(" ");
  const handleVoicesChanged = async e => {
    const voice = speechSynthesis.getVoices().find(({
      name
    }) => name.includes("English"));
    const utterance = new SpeechSynthesisUtterance(text);
    utterance.voice = voice;
    utterance.pitch = 0.33;
    utterance.rate = 0.1;
    const recorder = new MediaRecorder(stream);
    recorder.start();
    speechSynthesis.speak(utterance);
    recorder.ondataavailable = async({
      data
    }) => {
      console.log(URL.createObjectURL(data));
    }
    utterance.onend = e => (recorder.stop(), stream.getAudioTracks()[0].stop());
  }
  speechSynthesis.onvoiceschanged = handleVoicesChanged;
  let voices = speechSynthesis.getVoices();
  if (voices.length) {
    handleVoicesChanged();
    console.log(voices);
  }

})().catch(console.error);

Firefox throws an OverConstrained error when exact is used. However, Firefox does list

Monitor of Built-in Audio Analog Stereo