Add audio files for word pronunciation (TTS)

jzohrab commented 9 months ago

Audio files could be stored in user's data folder, and files could be found by md5 of term, e.g.

Big effort required:

web service calls to some uri to get a file
support different endpoints? polly, azure, forvo, etc etc --
API tokens in settings for the different services
selecting the service and voice you want to call
how to store the file, naming etc
allow only one sample, or multiple, for any given word?
include audio in any anki export
play on mouse over setting? or just mouse over the speaker?

Lots of things to do here.

Zireael07 commented 9 months ago

An alternative solution could be to read X-SAMPA and synthesize the sound (I know of at least two tools that can do it, either in Python or JS)

jzohrab commented 4 months ago

From @jubal in #412:

When I select a word and press t it nicely opens DeepL to translate the whole sentence for me. Similarly I would like to press p (for play) to listen to the whole sentence.

Anki has an TTS addon and I assume we could use the same services/method they are using, it seems to work quite reliable. Their list of supported services is here.

For now I have just added Yandex as dictionary that opens in a popup. I then click the sound button there. But it's a few steps too many to be comfortable to use.

jzohrab commented 4 months ago

Thanks @jubalh for the AwesomeTTS note (edit: HyperTTS appears to be the latest and greatest thing -- https://github.com/Vocab-Apps/anki-hyper-tts). It does use a lot of services. ~~AwesomeTTS~~ HyperTTS actually wrote its own layers between itself and all services to normalize the API, which is the right thing to do. I opened issue 175 in their repo to ask about separating things out, because integrating this within Lute in the way that they did would be a big effort -- if we could re-use their work, it would be a nice timesaver.

@Zireael07 - thanks for the note, sorry for no response. I'd not heard of X-SAMPA. I'm not sure how X-SAMPA (or IPA) would be added to Lute offhand, as for most languages it can't be easily accessed, AFAICT.

A final possibility, which isn't great but isn't terrible, is to allow for side-loading of audio files, like an out-of-band process that people can run to download and import audio files. (That process would imply, again, some kind of program to interact with various service APIs, and would eventually morph into something pretty much the same as HyperTTS!) The audio files would get mapped to terms somehow. That could do in a pinch, for people who desperately need audio support for some words. I certainly could use it for some Vietnamese words.

edit: changed awesometts to hypertts

luc-vocab commented 4 months ago

I've replied with some suggestions over email. Text to speech is not very complicated but in the case of HyperTTS, it does get complicated because users have very specialized needs (need to be able to change parameters such as speed, want to support mixing of voices, etc). You will probably have a simple approach in Lute, offer less voices and services (otherwise it could get overwhelming for the user). If you want to just support one service, Azure has the best language coverage.

Here's a radical idea: why not just use something like Yomichan which can pronounce words highlighted in the browser, and focus on non-TTS parts? If the TTS workflow is always "select, then pronounce", then Yomichan will do the job, tons of people use it for Japanese.

If you want to have audio files as part of the user's "collection" then of course you'll have to generate them yourself.

jzohrab commented 4 months ago

Hi @luc-vocab , thanks for the above and the email.

Lute is for a bunch of languages: Japanese, French, German, Irish, ... it's a pretty long list and is getting longer. I do want the audio files to be downloaded locally. I've used some services to read my pages as well, they work pretty well sometimes. But sometimes having a single high-quality audio recording for a short word makes a big difference.

Copying some items from your email here:

So I would say if you want only TTS, then maybe start with forking the HyperTTS code. It'll give you an easy way to filter voices by language and generate audio. If you go that route, your first task will be to get this unit test running: https://github.com/Vocab-Apps/anki-hyper-tts/blob/main/test_tts_services.py . It generates audio files, and then puts them through the azure speech to text API to verify the audio produced is correct. You could just start with one service, and add others as needed. Alternatively, if a single API, say Azure, provides all the voices and languages you need, you could directly call the Azure API. It's not very complicated. I have voice samples in all languages for all services, it can help you make a decision: https://www.vocab.ai/languages Is your software package hosted, or is it "bring your own API key" ? If you offer the service to end users, you have to be careful because users could run up huge bills. You'll have to do detailed usage tracking. And it gets complicated if you want to support multiple services, because the costs are different. ElevenLabs is 15 times as expensive as Azure.

Yes, it's BYOAPIKey, every user runs locally. I could try working with just one free service and see how it goes.

As an initial implementation for this, I'm pretty sure the right thing to do is an "out-of-band" process where people put their word lists together and run some kind of command-line thing to generate the files, rather than building it right into the Lute UI. Reasons: there are way too many little settings etc, and getting all this done in the UI would be a lot of extra work. Since this is free software, something "good enough" is just fine.

Zireael07 commented 4 months ago

@jzohrab I didn't mean that Lute would automagically find or handle IPA or X-SAMPA. I meant that it could have an optional pronunciation field that a user would fill and a 'play this' button that could synthesize

Yomichan approach (select instead of a button) is also a good idea

Trying to find an API that could handle any language a user might be learning is imo an exercise in futility

luc-vocab commented 4 months ago

Azure supports everything. They have the broadest coverage in language. If you decide to pick only one service, pick azure.

LuteOrg / lute-v3

Add audio files for word pronunciation (TTS) #36