Open rotemdan opened 1 year ago
Hi. First of all - very cool project. I tried to experiment with it and in order to use it in UI I needed client for the brwoser. This is what I end up doing (if anybody else needs it)
import { encode as encodeMsgPack, decode as decodeMsgPack } from "msgpack-lite";
import {
RequestVoiceListResult,
SynthesisOptions,
SynthesisSegmentEvent,
SynthesisResult,
VoiceListRequestOptions,
} from "echogarden/dist/api/Synthesis.js";
import { OpenPromise } from "echogarden/dist/utilities/OpenPromise.js";
import {
AudioSourceParam,
RawAudio,
} from "echogarden/dist/audio/AudioUtilities.js";
import {
AlignmentOptions,
AlignmentResult,
} from "echogarden/dist/api/Alignment.js";
import {
RecognitionOptions,
RecognitionResult,
} from "echogarden/dist/api/Recognition.js";
import {
SpeechTranslationOptions,
SpeechTranslationResult,
} from "echogarden/dist/api/Translation.js";
import {
SpeechLanguageDetectionOptions,
SpeechLanguageDetectionResult,
TextLanguageDetectionOptions,
TextLanguageDetectionResult,
} from "echogarden/dist/api/LanguageDetection.js";
import {
SynthesisResponseMessage,
SynthesisSegmentEventMessage,
SynthesisSentenceEventMessage,
VoiceListRequestMessage,
VoiceListResponseMessage,
AlignmentRequestMessage,
AlignmentResponseMessage,
RecognitionRequestMessage,
RecognitionResponseMessage,
SpeechTranslationRequestMessage,
SpeechTranslationResponseMessage,
SpeechLanguageDetectionRequestMessage,
SpeechLanguageDetectionResponseMessage,
TextLanguageDetectionResponseMessage,
TextLanguageDetectionRequestMessage,
SynthesisRequestMessage,
} from "echogarden/dist/server/Worker.js";
function getRandomHexString(charCount = 32, upperCase = false) {
if (charCount % 2 !== 0) {
throw new Error(`'charCount' must be an even number`);
}
const randomBytes = (size: number) =>
[...Array(size)]
.map(() => Math.floor(Math.random() * 16).toString(16))
.join("");
let hex = randomBytes(charCount / 2);
if (upperCase) {
hex = hex.toUpperCase();
}
return hex;
}
const log = console.log.bind(console);
export class BrowserClient {
sendMessage: (message: any) => void;
responseListeners = new Map<string, (message: string) => void>();
constructor(sourceChannel: WebSocket) {
sourceChannel.addEventListener(
"message",
async (messageData: MessageEvent<Blob>) => {
try {
const data = await messageData.data.arrayBuffer();
const incomingMessage = decodeMsgPack(new Uint8Array(data));
this.onMessage(incomingMessage);
} catch (e) {
log(`Failed to decode incoming message. Reason: ${e}`);
return;
}
}
);
this.sendMessage = (outgoingMessage) => {
const encodedMessage = encodeMsgPack(outgoingMessage);
sourceChannel.send(encodedMessage);
};
}
async synthesize(
input: string | string[],
options: SynthesisOptions,
onSegment?: SynthesisSegmentEvent,
onSentence?: SynthesisSegmentEvent
): Promise<SynthesisResult> {
const requestOpenPromise = new OpenPromise<SynthesisResult>();
const requestMessage: SynthesisRequestMessage = {
messageType: "SynthesisRequest",
input,
options,
};
function onResponse(
responseMessage:
| SynthesisResponseMessage
| SynthesisSegmentEventMessage
| SynthesisSentenceEventMessage
) {
if (responseMessage.messageType == "SynthesisResponse") {
requestOpenPromise.resolve(responseMessage);
} else if (
responseMessage.messageType == "SynthesisSegmentEvent" &&
onSegment
) {
onSegment(responseMessage);
} else if (
responseMessage.messageType == "SynthesisSentenceEvent" &&
onSentence
) {
onSentence(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async requestVoiceList(
options: VoiceListRequestOptions
): Promise<RequestVoiceListResult> {
const requestOpenPromise = new OpenPromise<RequestVoiceListResult>();
const requestMessage: VoiceListRequestMessage = {
messageType: "VoiceListRequest",
options,
};
function onResponse(responseMessage: VoiceListResponseMessage) {
if (responseMessage.messageType == "VoiceListResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async recognize(
input: AudioSourceParam,
options: RecognitionOptions
): Promise<RecognitionResult> {
const requestOpenPromise = new OpenPromise<RecognitionResult>();
const requestMessage: RecognitionRequestMessage = {
messageType: "RecognitionRequest",
input,
options,
};
function onResponse(responseMessage: RecognitionResponseMessage) {
if (responseMessage.messageType == "RecognitionResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async align(
input: AudioSourceParam,
transcript: string,
options: AlignmentOptions
): Promise<AlignmentResult> {
const requestOpenPromise = new OpenPromise<AlignmentResult>();
const requestMessage: AlignmentRequestMessage = {
messageType: "AlignmentRequest",
input,
transcript,
options,
};
function onResponse(responseMessage: AlignmentResponseMessage) {
if (responseMessage.messageType == "AlignmentResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async translateSpeech(
input: string | Buffer | Uint8Array | RawAudio,
options: SpeechTranslationOptions
): Promise<SpeechTranslationResult> {
const requestOpenPromise = new OpenPromise<SpeechTranslationResult>();
const requestMessage: SpeechTranslationRequestMessage = {
messageType: "SpeechTranslationRequest",
input,
options,
};
function onResponse(responseMessage: SpeechTranslationResponseMessage) {
if (responseMessage.messageType == "SpeechTranslationResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async detectSpeechLanguage(
input: AudioSourceParam,
options: SpeechLanguageDetectionOptions
): Promise<SpeechLanguageDetectionResult> {
const requestOpenPromise = new OpenPromise<SpeechLanguageDetectionResult>();
const requestMessage: SpeechLanguageDetectionRequestMessage = {
messageType: "SpeechLanguageDetectionRequest",
input,
options,
};
function onResponse(
responseMessage: SpeechLanguageDetectionResponseMessage
) {
if (responseMessage.messageType == "SpeechLanguageDetectionResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
async detectTextLanguage(
input: string,
options: TextLanguageDetectionOptions
): Promise<TextLanguageDetectionResult> {
const requestOpenPromise = new OpenPromise<TextLanguageDetectionResult>();
const requestMessage: TextLanguageDetectionRequestMessage = {
messageType: "TextLanguageDetectionRequest",
input,
options,
};
function onResponse(responseMessage: TextLanguageDetectionResponseMessage) {
if (responseMessage.messageType == "TextLanguageDetectionResponse") {
requestOpenPromise.resolve(responseMessage);
}
}
function onError(e: any) {
requestOpenPromise.reject(e);
}
try {
this.sendRequest(requestMessage, onResponse, onError);
} catch (e) {
onError(e);
}
return requestOpenPromise.promise;
}
sendRequest(
request: any,
onResponse: (message: any) => void,
onErrorResponse: (error: any) => void
) {
const requestId = getRandomHexString();
request = {
requestId,
...request,
};
this.sendMessage(request);
function onResponseMessage(message: any) {
if (message.messageType == "Error") {
onErrorResponse(message.error);
} else {
onResponse(message);
}
}
this.responseListeners.set(requestId, onResponseMessage);
}
onMessage(incomingMessage: any) {
const requestId = incomingMessage.requestId;
if (!requestId) {
log("Received a WebSocket message without a request ID");
return;
}
const listener = this.responseListeners.get(requestId);
if (listener) {
listener(incomingMessage);
}
}
}
The processing is currently designed to run on Node.js only. I also mentioned that on a recent issue where someone tried to bundle the package in a similar way (it can't be bundled since its using many Node.js-only APIs and libraries, and its dependency tree is very complex).
The web-based UI described in this issue is designed to communicate with a Node.js server that does all the processing. I did open a different issue about porting some features to the browser but that isn't something that I foresee would happen any time soon.
Developing a web-based frontend UI would be great but would also be a large, complex task, also involving many decisions about frameworks, etc. Also there are so many different options, and multiple potential outputs for each operation, where some operations require some specialized UX. In general, I wouldn't compromise on mediocre UX, so developing something that's high-quality, would probably take up to several months of work. I can't commit to that at this time.
I understand, based on the feedback on this tracker, that forced alignment has become a major point of interest for people for using this toolset, possibly because it's hard to find good alternatives for that particular task. I'm trying to prioritize that, and I've made a lot of improvements, especially recently with the VAD-based cropping and work on the Whisper-based guided decoding approach (that can outperform both DTW and DTW-RA in many cases).
Performing forced alignment in the browser is "cool" and I did implement it several years ago in a different project, but it turned out that the use cases for that are relatively limited, and it suffers from various limitations imposed of the browser (I also wrote about it in #48). It's a large amount of work to port it to the browser, but the benefits/returns aren't that clear to me.
True. Makes sense, I was speed running forced alignment for something that I am building, just realised that a lot happening underneath. Better to run in a server. Thanks.
Currently, running the server (
echogarden serve
) and opening the local host HTTP page (http://localhost:45054
) shows a basic placeholder message ("This is the Echogarden HTTP server!"
)Gradually, start developing a graphical user interface to replace it.
Since a lot of functionality is already available via the command line, there's no need to rush to have all features supported immediately. Try to concentrate on features that benefit from a graphical UI the most.
For example the ability to try different TTS engines and voices is much easier and faster to do using a UI, than with a CLI.
Task list
plainText.*
,subtitles.*
, etc.(TODO: extend with more tasks..)