Open HappyMac3920 opened 10 months ago
Microsoft Clipchamp also has a wider range of Azure TTS voices (even multilingual AI ones!) which work even with a free subscription, but they require an authorization token and an account to use. :(
Of all the requests I get to add voices this looks like the most reasonable and realistic one to look into next. I appreciate the link to rany2's edge-tts project as I suspect we can borrow some stuff from that, namely the client token it seems to require and the URLs it sends requests to.
As for Clipchamp, I'm unfamiliar with this but if it's like you say and requires an account and auth token that could potentially make things more difficult.
Can't promise a timeframe for when this gets done but I will see what I can do.
Just as a side note, there are other implementations of the edge-tts/readaloud interface in languages (like JS) possibly more suitable to you in case you don't like Python. In any case, be aware that the implementation cannot be done using simple internet requests alone like for all the other sites, but you will need to implement a streaming interface using websockets. It can (and has been) done, but is a bit more tricky. :)
Of all the requests I get to add voices this looks like the most reasonable and realistic one to look into next. I appreciate the link to rany2's edge-tts project as I suspect we can borrow some stuff from that, namely the client token it seems to require and the URLs it sends requests to.
As for Clipchamp, I'm unfamiliar with this but if it's like you say and requires an account and auth token that could potentially make things more difficult.
Can't promise a timeframe for when this gets done but I will see what I can do.
Well for Clipchamp it requires an auth token. But if I had a guess, it requires an auth token from an account, but I wasn't able to find out where the auth token is fetched.
Also, in the voice list I did find 4 multilingual voices (fr-FR-VivienneMultilingualNeural, fr-FR-RemyMultilingualNeural, de-DE-FlorianMultilingualNeural and de-DE-SeraphinaMultilingualNeural) that do work, but there is a caveat: I tried English text on de-DE-FlorianMultilingualNeural, it did synthesize correctly in English, but when I tried in Hungarian through edge-tts, it did not say the text properly, but in Clipchamp it did read the text correctly in Hungarian. I think it must be some sort of limitation imposed in the Edge TTS servers. Turns out I was wrong. On multilingual voices, Hungarian text works, but it does not always detect it properly.
Just as a side note, there are other implementations of the edge-tts/readaloud interface in languages (like JS) possibly more suitable to you in case you don't like Python. In any case, be aware that the implementation cannot be done using simple internet requests alone like for all the other sites, but you will need to implement a streaming interface using websockets. It can (and has been) done, but is a bit more tricky. :)
I assume you are referring to https://github.com/Migushthe2nd/MsEdgeTTS but I am not aware of other similar projects that use Javascript.
There are some JS-based Greasemonkey/Tampermonkey browser plugins if I remember correctly. Also several Chinese sites have sources for the communication to the MS servers using JS. Best ist probably to search for the websocket URL endpoint as seen in the python project (or the token) to find similar projects. I played with several of these a month ago, but abandoned them for easier to use TTS sites.
Here are some links I used for information in my integration:
https://github.com/LokerL/tts-vue/blob/main/electron/utils/edge-api.ts
https://www.52pojie.cn/thread-1711887-1-1.html
https://gist.github.com/wilinz/13f4bc343754f01c06a90aa9c12e449a
https://greasyfork.org/en/scripts/471039-gpt%E8%AF%AD%E9%9F%B3%E5%8A%A9%E6%89%8B/code
Microsoft Clipchamp also has a wider range of Azure TTS voices (even multilingual AI ones!) which work even with a free subscription, but they require an authorization token and an account to use. :(
Copilot's voice is another Microsoft TTS voice that (I think) cannot be implemented to this website due to the fact that it sends a message ID to the server, not plain text. It is actually multilingual though.
I am writing this here because it is a bit related to the MS Edge TTS API.
I dug deeper into the Clipchamp TTS API, and here is what I found:
The API first requests a "token" from https://app.clipchamp.com/v2/azure-cognitive/auth-token with my accounts auth token required in the headers. If successful, a JWT client token is handed out in a JSON format, with the token needed for TTS and a region, in my case "eastus", although I live in Hungary. (there might be other regions too)
So we have two tokens now, an account token, and a TTS token.
The voice list is in https://eastus.tts.speech.microsoft.com/cognitiveservices/voices/list but that also requires the account auth token.
A voice request is made as a websocket request, kind of similar to the MS Edge one. The URL is wss://eastus.tts.speech.microsoft.com/cognitiveservices/websocket/v1?Authorization=ttstokengoeshere&X-ConnectionId=connectionidgoeshere
SSML is supported when requesting, the rate and the pitch may be modified.
Also, I did see similarities in the request form (metadata and audio output):
Clipchamp requests:
{"synthesis":{"audio":{"metadataOptions":{"bookmarkEnabled":false,"punctuationBoundaryEnabled":"false","sentenceBoundaryEnabled":"false","sessionEndEnabled":true,"visemeEnabled":false,"wordBoundaryEnabled":"false"},"outputFormat":"audio-24khz-48kbitrate-mono-mp3"},"language":{"autoDetection":false}}}
MS Edge requests:
{"context":{"synthesis":{"audio":{"metadataoptions":{"sentenceBoundaryEnabled":false,"wordBoundaryEnabled":true},"outputFormat":"audio-24khz-48kbitrate-mono-mp3"}}}}
Notice that the audio output it requests is same on both services.
I am not familiar with how websockets work, I do know quite a few things about JWT, notably the fact that Apple's server which hosts software updates for their products uses JWTs as a response with update URLs which are also included in the JWT, I did mess with it quite a few times especially decoding the base64 payload.
So overall, it is heavily similar to the MS Edge API, but because it requires authorization, I most likely think that the Clipchamp TTS API is difficult to implement. As I mentioned in my second post, Clipchamp TTS is free, so it does not require a subscription at all to use, only a Clipchamp account (can be from a Microsoft Account, a Google Account, or from an e-mail address).
Microsoft Edge has a feature called Immersive Reader that can read web pages to the user. Before you ask, yes it does have the same voices as the Bing Translator has, except the Edge API has even more voices (such as en-US-AndrewNeural and en-US-AvaNeural, although it is important to note that the API might not have all the voices, such as fi-FI-SelmaNeural), and the audio quality is better compared to the Bing Translator API. This has already been reversed in https://github.com/rany2/edge-tts so if you plan on implementing it to the website, this can give you a head start.