Client-side, Server-side and Third-party Speech Recognition, Synthesis and Translation

AdamSobieski commented 5 years ago

Introduction

We can envision and consider client-side, server-side and third-party speech recognition, synthesis and translation scenarios for a next version of the Web Speech API.

Advancing the State of the Art

Speech Recognition

Beyond speech-to-text, speech recognition includes speech-to-SSML and speech-to-hypertext. With speech-to-SSML and speech-to-hypertext, there can be a higher degree of fidelity possible for round-tripping speech audio through speech recognition and synthesis components or services.

Speech Synthesis

Beyond text-to-speech, speech synthesis includes SSML-to-speech and hypertext-to-speech.

Translation

Translation scenarios include processing text, SSML, hypertext or audio in a source language into text, SSML, hypertext or audio in a target language.

Desirable features include interoperability between client-side, server-side and third-party translation and WebRTC with translations available as subtitles or audio tracks.

Multimodal Dialogue Systems

Interesting scenarios include Web-based multimodal dialogue systems which efficiently utilize client-side, server-side and third-party speech recognition, synthesis and translation.

Client-side Scenarios

Client-side Speech Recognition

These scenarios are considered in the current version of the Web Speech API.

Client-side Speech Synthesis

These scenarios are considered in the current version of the Web Speech API.

Client-side Translation

These scenarios are new to the Web Speech API and involve the client-side translation of text, SSML, hypertext or audio into text, SSML, hypertext or audio.

Server-side Scenarios

Server-side Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client being streamed to a server which performs speech recognition, optionally providing speech recognition results to the client.

Server-side Speech Synthesis

These scenarios are new to the Web Speech API and involve a client sending text, SSML or hypertext to a server which performs speech synthesis and streams audio to the client.

Server-side Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a server for translation into text, SSML, hypertext or audio.

Third-party Scenarios

Third-party Speech Recognition

These scenarios are new to the Web Speech API and involve one or more audio streams from a client or server being streamed to a third-party service which performs speech recognition providing speech recognition results to the client or server.

Third-party Speech Synthesis

These scenarios are new to the Web Speech API and involve a client or server sending text, SSML or hypertext to a third-party service which performs speech synthesis and streams audio to the client or server.

Third-party Translation

These scenarios are new to the Web Speech API and involve a client sending text, SSML, hypertext or audio to a third-party translation service for translation into text, SSML, hypertext or audio.

Hyperlinks

Amazon Web Services

Google Cloud AI

IBM Watson Products and Services

Microsoft Cognitive Services

Real Time Translation in WebRTC

guest271314 commented 5 years ago

One important item to note here relevant to speech recognition is that Chrome/Chromium currently records the user and sends the recording to a server - without any notification provided that clearly indicates the user of the browser is being recorded and their voice is being sent to an external server. Also, it is not clear if the users' biometrics (their voice) is stored (forever) by the service; see https://bugs.chromium.org/p/chromium/issues/detail?id=816095

AdamSobieski commented 5 years ago

@guest271314 , thank you. By including these client-side, server-side and third-party scenarios in the design of standard APIs and by more tightly integrating such standards and APIs with WebRTC we can: (1) provide users with notifications and permissions with respect to which client-side, server-side and third-party components and services are accessing their microphones and their text, SSML, hypertext and audio streams, (2) produce efficient call graphs (see also: https://youtu.be/EPBWR_GNY9U?t=2m from 2:00 to 4:12), (3) reduce latency for real-time translation scenarios, (4) improve quality for real-time translation scenarios.

AdamSobieski commented 5 years ago

I’m hoping to inspire interest in post-text speech technology (speech-to-X₁ and X₂-to-speech) as well as interest in round-tripping where we can utilize acoustic measures and metrics to compare the audio input to and output from speech-to-X₁-to-X₂-to-speech.

X₁, X₂ could be SSML (1.0, 1.1 or 2.0), hypertext or new formats.

X₁-to-X₂ machine translation is also topical.

AdamSobieski commented 5 years ago

In the video Real Time Translation in WebRTC, the speaker indicates (at 7:48) that a major issue which he would like to see solved is that users have to pause their speech before speech recognition and translation occur.

Towards reducing latency, we can consider real-time online speech recognition algorithms which, instead of processing natural language sentence-by-sentence and outputting X₁, process natural language lexeme-by-lexeme and produce event streams. In these low-latency approaches, speech recognition components and services process speech audio in real-time and produce event steams which are consumed by machine translation components which produce event streams which are consumed by speech synthesis components which produce resultant speech audio.

AdamSobieski commented 5 years ago

This issue pertains to Issue 1 in the Web Speech API specification.

Issue 1: The group has discussed whether WebRTC might be used to specify selection of audio sources and remote recognizers. See Interacting with WebRTC, the Web Audio API and other external sources thread on public-speech-api@w3.org.

WICG / speech-api