AudioWG virtual F2F: AudioOutputContext

guest271314 commented 2 years ago

AudioWG virtual F2F:

This is desirable, and can be implemented on some systems fully (Jack on Linux for example, or using kernel extensions on other OSes), but is out of scope of the Web Audio API. This is clearly the responsibility of the WebRTC working group, because they deal with device access and permissions. The security considerations are extremely important: without an explicit user consent, the audio of the machine could be exfiltrated to any website.
Closing, but we'll follow up in other groups if needed.

Originally posted by @padenot in https://github.com/WebAudio/web-audio-api-v2/issues/106#issuecomment-846093297

guest271314 commented 2 years ago

Describe the feature Route system ~~or specific application~~ audio output through a dedicated audio context intended to be utilized for this purpose.

Is there a prototype? Yes. https://github.com/guest271314/captureSystemAudio/tree/master/native_messaging/capture_system_audio.

In pertinent part re Notification

async function nativeMessageStream() {
  return new Promise(async (resolve) => {
    let permission = await navigator.permissions.query({
      name: 'notifications',
    });
    if (permission.state !== 'granted') {
      permission = await Notification.requestPermission();
    }
    if (permission.state === 'granted' || permission === 'granted') {
      const captureSystemActionNotification = new Notification('Save file?', {
        body: `Click "Activate" to capture system audio.`,
      });
      captureSystemActionNotification.onclick = async (e) => {
        onmessage = (e) => {
          if (e.origin === this.src.origin) {
            if (!this.source) {
              this.source = e.source;
            }
            if (e.data === 1) {
              this.source.postMessage(
                { type: 'start', message: this.stdin },
                '*'
              );
            }
            if (e.data === 0) {
              document
                .querySelectorAll(`[src="${this.src.href}"]`)
                .forEach((iframe) => {
                  document.body.removeChild(iframe);
                });
              onmessage = null;
            }
            if (e.data instanceof ReadableStream) {
              this.stdout = e.data;
              resolve(this.captureSystemAudio());
            }
          }
        };
        this.transferableWindow = document.createElement('iframe');
        this.transferableWindow.style.display = 'none';
        this.transferableWindow.name = location.href;
        this.transferableWindow.src = this.src.href;
        document.body.appendChild(this.transferableWindow);
      };
    }
  }).catch((err) => {
    throw err;
  });
}

Describe the feature in more detail User performs user action, e.g., clicking a Notification to capture system audio output. Entire system audio output is routed through a dedicated MediaStreamAudioDestinationNode or dedicated AudioWorkletProcessor.

User calls stop() on MedisStreamAudioDestinationNode or return false in process() to stop system audio capture.

Prior feature request in we-baudio-api-v2 https://github.com/WebAudio/web-audio-api-v2/issues/106.

https://github.com/WebAudio/web-audio-api-v2/issues/106#issuecomment-904246686

@PaulFidika

capture user-audio on this machine, then stream or record it let users select in their browser what audio device they want their music to play out of These two requests are different; the latter is being tracked by #10 and the WG's response to the former request is laid out in https://github.com/WebAudio/web-audio-api-v2/issues/106#issuecomment-846093297. However, please feel free to open a new issue if you have different thoughts.

This is implementable on Linux, Windows, and macOS https://www.buildtoconnect.com/help/how-to-record-system-audio.

No movements elsewhere to implement this. Only piecemeal issues here and there that become circular references, e.g., https://github.com/w3c/mediacapture-output/issues/125.

I added the Notification to the beginning of the code to demonstrate such a requirement is trivial to implement at the implementer source code level.

Thus, there should be no issue specifying and implementing an AudioOutputContext. Whether or not browser source code authors implement the specification is a separate matter. The technology exists to achieve the requirement .

hoch commented 2 years ago

As pointed out above, the WG still believes this is out of the scope of Web Audio API.

Is there any specific reason to create a new issue instead of following up on the existing one?

guest271314 commented 2 years ago

As pointed out above, the WG still believes this is out of the scope of Web Audio API.

Currently this is out of scope for all W3C working groups. Nobody has specified nor implemented system audio output capture. Every group passes the buck to a different group and nothing ever gets done to realize this.

Is there any specific reason to create a new issue instead of following up on the existing one?

There is no existing issue that I am aware of. https://github.com/WebAudio/web-audio-api-v2/ is closed and archived.

guest271314 commented 2 years ago

As pointed out above, the WG still believes this is out of the scope of Web Audio API.

Web Audio working group will not be stepping on the toes of Media Capture and Streams working group by specifying and implementing this. I already filed multiple issues and PR in Media Capture and Streams and nothing got done.

I am not certain what the problem is with just specifying and implementing this, here?

guest271314 commented 2 years ago

This specification https://github.com/w3c/mediacapture-output does not capture speakers or headphones and this PR https://github.com/w3c/mediacapture-output/pull/128 does not change that fact because 'audiooutput' does not mean system audio output in Media Capture Output API - and the quality of audio when capturing using WebRTC streams via getUserMedia() to set Chromium does not support capture of monitor devices by default #17 is subpar. We need the raw PCM to get quality streams and recording.

Currently no W3C specification or recommendation, including https://github.com/w3c/mediacapture-screen-share handles capturing system audio output. That is why this answer https://stackoverflow.com/a/70665493 at How to capture generated audio from window.speechSynthesis.speak() call? records only silence, and the Chrome bug/feature request filed https://bugs.chromium.org/p/chromium/issues/detail?id=1291146, unbeknown to the OP of the issue, is a duplicate of existing bugs https://bugs.chromium.org/p/chromium/issues/detail?id=1185527 and specification issues https://lists.w3.org/Archives/Public/public-speech-api/2017Jun/0000.html, https://github.com/WICG/speech-api/issues/69, https://github.com/WebAudio/web-audio-api/issues/1764.

Again, I just see people and groups passing the buck to each other and years go by without anything actually getting done for the use cases of capturing speech synthesis audio output and entire system audio output to speakers and headphones.

I demonstrated how trivial it is to just use Notification for prompt. I see no reason to not specify this here, in the domain of web audio. The technology exists to achieve the goal, the will to do so evidently does not.

guest271314 commented 2 years ago

A prudent step at this stage is to simply ask @jan-ivar if specifying system audio output capture here in Web Audio API would be intruding into the domains of Media Capture and Streams, Media Capture Screen Share, or Media Capture Output?

(I don't see how that could rationally be the case where none of those specifications have the clear goal of producing that deliverable).

padenot commented 2 years ago

Is there anything wrong with getDisplayMedia({audio: true}) ?

https://w3c.github.io/mediacapture-screen-share/#dom-mediadevices-getdisplaymedia

guest271314 commented 2 years ago

Is there anything wrong with getDisplayMedia({audio: true}) ?

https://w3c.github.io/mediacapture-screen-share/#dom-mediadevices-getdisplaymedia

Well, that alone won't work. You still need to get video then use removeTrack(). Even then that will not capture output of window.speechSynthesis.speak(), and the quality is subpar (on Chromium/Chrome 101) compared to raw PCM that is fed to AudioWorkletProcessor.process.

I contacted the author of this answer https://stackoverflow.com/a/70665493 at How to capture generated audio from window.speechSynthesis.speak() call? to let them know their code only records silence. Try for yourself.

guest271314 commented 2 years ago

I will point out Mozilla does not have this long-standing issue https://bugs.chromium.org/p/chromium/issues/detail?id=1185527 which has several source causes. Thus for interoperabablity we need to just specify and implement system audio output capture to avoid ambiguity and finally resolve this issue.

guest271314 commented 2 years ago

On Chromium/Chrome speechSynthesis.speak() does not output audio through the "Tab" itself Issue 1107210: Speech Synthesis isn't wired up to "Audio is playing" tab icons. If Google voices are used, which relies on Native Client executable to process text and voice output the playback is Google voice; if speech-dispatcher is used with your own speech synthesis engine, e.g., eSpeak NG, then specch-dispatcher-<module> is used for plaback, again, not output via the Tab itself.

With PulseAudio we can use --monitor-stream=N to capture specific application output, see https://bugs.chromium.org/p/chromium/issues/detail?id=1136480#c9, which is what I did here https://github.com/guest271314/setUserMediaAudioSource#usage using the deprecation QuicTransport (now WebTransport)

navigator.mediaDevices
  .getUserMedia({ audio: true })
  .then(async (stream) => {
    [track] = stream.getAudioTracks();
    const _sources = await setUserMediaAudioSource(
      'get-audio-sources'
    );
    const _source_outputs = await setUserMediaAudioSource(
      'get-audio-source-outputs'
    );
    return { _sources, _source_outputs };
  })
  .then(async ({ _sources, _source_outputs }) => {
    const __source_outputs = _source_outputs.match(
      /(?<=Source\sOutput\s#)\d+|(?<=Sample\sSpecification:\s).*$|(?<=\s+(media|application)(\.name|\.process\.binary)\s=\s").*(?="$)/gm
    );
    const __sources = _sources.match(
      /(?<=Source\s#)\d+|(?<=(Name|Description):\s+).*$/gm
    );
    do {
      const [
        index,
        sample_specification,
        media_name,
        application_name,
        application_process_binary,
      ] = __source_outputs.splice(0, 5);
      source_outputs.push({
        index,
        sample_specification,
        media_name,
        application_name,
        application_process_binary,
      });
    } while (__source_outputs.length);
    do {
      const [index, name, description] = __sources.splice(0, 3);
      sources.push({ index, name, description });
    } while (__sources.length);

    return setUserMediaAudioSource([
      source_outputs.find(
        ({ application_process_binary }) =>
          application_process_binary === 'chrome'
      ).index,
      sources.find(
        ({ description }) =>
          description === 'Monitor of Built-in Audio Analog Stereo'
      ).index,
    ]);
  })
  .then(console.log)
  .catch(console.error);

so, yes, we need the capability to select specific outputs to capture, and entire system audio capture - not just the output on the Tab.

This is all doable. I already did it, several ways. The most reliable, what I use is https://github.com/guest271314/captureSystemAudio/tree/master/native_messaging/capture_system_audio. I substitued a C++ Native Messaging host for Python hoost, which reduced MB used from 12-13 to 4-5.

guest271314 commented 2 years ago

Additionally, getDisplayMedia({video: true, audio: true}) only captures the Tab. When I run mpv video.mkv where audio and video are output, getDisplayMedia({video: true, audio: true}) only captures the application window, on Chromium 101 a single FocasableMediaStreamTrack, on Firefox 98 a single MediaStreamTrack of kind "video". getDisplayMedia() doesn't meet the requirement of the use cases.

padenot commented 1 year ago

It only captures the tab on Linux and macOS, but captures the computer audio on Windows and Chrome OS. It's an OS limitation, there's not much spec work to do.

guest271314 commented 1 year ago

It only captures the tab on Linux and macOS, but captures the computer audio on Windows and Chrome OS. It's an OS limitation, there's not much spec work to do.

Can you clarify what you are talking about, getDisplayMedia()? There is no OS limitation for capturing entire system audio or specific audio device, e.g., speech-dispatcher, on Linux. Any limitation on Chrome is a self-imposed limitation by Chrome authors. Spec-wise we certainly can spell out capabilities that exist which will further expose Chrome authors just refuse to implement, right now.

guest271314 commented 1 year ago

It's an OS limitation, there's not much spec work to do.

To demonstrate there is no limitation on Linux, rather a self-imposed Chrome limitation see https://github.com/guest271314/captureSystemAudio#pulseaudio-module-remap-source, https://aweirdimagination.net/2020/07/19/virtual-microphone-using-gstreamer-and-pulseaudio/

pactl load-module module-remap-source \
  master=@DEFAULT_MONITOR@ \
  source_name=virtmic source_properties=device.description=Virtual_Microphone

var recorder;
const devices = await navigator.mediaDevices.enumerateDevices();
const device = devices.find(({label})=>label === 'Virtual_Microphone');
const stream = await navigator.mediaDevices.getUserMedia({
          audio: {
            deviceId: {
              exact: device.deviceId
            },
            echoCancellation: false,
            noiseSupression: false,
            autoGainControl: false,
            channelCount: 2,
          },
        });
const [track] = stream.getAudioTracks();
console.log(devices, track.label, track.getSettings(), await track.getConstraints());
// do stuff with rempapped monitor device
recorder = new MediaRecorder(stream);
recorder.ondataavailable = e => console.log(URL.createObjectURL(e.data));
recorder.onstop = () => recorder.stream.getAudioTracks()[0].stop();
recorder.start();
setTimeout(()=>recorder.stop(), 10000);

guest271314 commented 1 year ago

I haven't tried this on Mac, others have faced this https://github.com/edisionnano/Screenshare-with-audio-on-Discord-with-Linux#prologue

Screensharing with desktop audio was recently fixed on Mac OS on the official client only through a proprietary hack since getting desktop audio on Mac isn't easy and you need something like Soundflower to interact with the kernel and electron/chromium don't have such functionality.

padenot commented 1 year ago

Yes, it's possible on Linux desktop as well, but not possible without third-party software on macOS.

Anyway, there is nothing missing on the web platform for this feature, I'm closing this.

guest271314 commented 1 year ago

Anyway, there is nothing missing on the web platform for this feature

Really?

Then what Web API implements this?

padenot commented 1 year ago

getDisplayMedia({audio: true}) allows capturing the device's output, as you note.

guest271314 commented 1 year ago

Not it does not. It only captures the Tab. Not entire system audio.

guest271314 commented 1 year ago

We cannot capture, for example, Web Speech API output on that same Tab.

There is plenty third-party software in Firefox source code, and Chrome.

I have no idea why you closed this.

guest271314 commented 1 year ago

getDisplayMedia({audio: true})

throws without video:true.

You are closing an issue where no Web API solution exists.

And citing macOS as the rationale. Well, write the specification and force macOS to answer the question and get on board. Happens all the time.

I don't understand your reasoning for closing this.

WebAudio / web-audio-api

AudioWG virtual F2F: AudioOutputContext #2478