Use a TTS voice across regions with the same language

marisademeglio commented 8 months ago

Create the TTS Config file with multiple voice entries for the same voice name but iterate through the all available (in the voices list) language+region combos that include the voice's language.

This way, a preferred en-CA voice could get used for an en-IN document even if no en-IN voice is preferred.

See https://daisy-dev.slack.com/archives/C064GB8U9/p1700499741742109

marisademeglio commented 8 months ago

This will also require an adjustment when ingesting existing settings files as they all have prio = 1 but we want the preferred voices to have prio=2 and the derived entries for that same voice to have prio=1

bertfrees commented 2 months ago

Now that locales in the voice config XML are interpreted as language ranges, it shouldn't be needed anymore to iterate through the all available language+region combos. A single entry with just the language subtag (e.g. en) has the same effect. I might still want to have two priority levels though.

marisademeglio commented 2 months ago

Ok can we close this issue? Prioritization should be covered by #169 .

bertfrees commented 2 months ago

@marisa Here is an example of a voice configuration XML to demonstrate what I said about locales being language ranges now:

<config>
  <voice lang="en-IN" engine="google" name="en-IN-Standard-A" gender="female-adult" priority="1"/>
  <voice lang="en" engine="google" name="en-IN-Standard-A" gender="female-adult" priority="1"/>
  <voice lang="en-US" engine="google" name="en-US-Wavenet-J" gender="male-adult" priority="1"/>
  <voice lang="en" engine="google" name="en-US-Wavenet-J" gender="male-adult" priority="1"/>
</config>

Compared to what the current config looks like when you select the "en-IN-Standard-A" and "en-US-Wavenet-J" voices, the new config above has two additional voice mappings, for lang="en". This is needed because a voice for a specific region is not automatically applied to locales without that region subtag anymore. Since en is equivalent to en-*, one of these new mappings will be used for other "en" dialects than "en-IN" and "en-AU". Note how it is still unpredictable which voice will be chosen for e.g. "en-GB", unless a gender has been specified in CSS. This is where the priority attribute would help to make one of the voices more preferred (https://github.com/daisy/pipeline-ui/issues/169).

MDipendra commented 2 months ago

Expected behaviour of TTS in Pipeline app:

Case 1: In Pipeline app, we select Indian English voice. Document can have any dialect of English as Lang attribute. Example, US English or a mix of US English and Australian English. Then: Recording is done in chosen Indian English voice.

Case 2: In Pipeline app, we select Indian English voice. Document has text in any dialect of English and in Hindi. Than: English text is recorded in chosen Indian English voice and Hindi text is recorded in one of the Hindi voice.

Case 3: In Pipeline app, we select Indian English and Australian English voice. Document has text in Indian English and US English Than: Since there is a match of Indian English and no match for US English, entire document is recorded in Indian English voice is used and Australian English voice is ignored.

Case 4: In Pipeline app, we select Indian English, Nigerian English and US English voices. Document has text in Indian English, Nigerian English and US English Then: Text is recorded in their respective dialects.

marisademeglio commented 2 months ago

Ok thanks for the info both of you; I am thinking about how to best present all this information.

A user having to decide "I want to prioritize these voices, and out of the 3 English ones, I want this en-IN one as the one to use across other types of English" is a rather complicated question.

I was playing with a table of voices to see how many there are for each language, you can see it here: https://664e9474fae363742287c120--pipeline-voices-table.netlify.app/

Across 3 engines, there are 200+ English voices, not to mention 80 languages (not counting regions, just languages).

We probably need some additional settings dialog screens to help make this more manageable. A side effect should be that screen-readers would get less overwhelmed by a giant list of voices.

MDipendra commented 2 months ago

A good example of this is how we choose voices in NVDA. Instead of a table of all voices, We have set of 3 combo-boxes:

TTS, Language (including the dialect) and third is the list of voices.

Choice of first combo-box populates second combo-box. Similarly, choice made in second combo-box populated the third combo-box.

The updating of the second combo-box should wait till the choice in first combo-box is finalized. In other words, When we are in first combo-box, arrow keys should expand the list of available choices and not start updating the dependent combo-box on every down-up-arrow key press.

marisademeglio commented 1 month ago

Ok that's an idea to make selecting the voices easier. And then we need another way to look at all the selected voices in a language (not counting region) to choose one to use as the fallback for that language.

We could also start out by having the user identify which languages they are interested in, and then add voices to each language (again language not counting region).

MDipendra commented 1 month ago

Sure, that approach would also be good to select languages first.

Thanks

Dipendra

marisademeglio commented 1 month ago

@MDipendra @prashantverma2014 @bertfrees Could you look at this idea and let me know if it could work for the TTS settings dialog? You can pick preferred voices via drop down filters like what Dipendra described, and then you can pick one voice per language "group" (e.g. language regardless of region) to be the default.

Suggestions welcome!

Again, the voice list is hard coded, it's just a mock-up.

https://665145ce30bad9a628e61770--pipeline-voices-table.netlify.app/voices2.html

prashantverma2014 commented 1 month ago

Dear Marisa,

I like this design.

Few suggestions and questions:

The “Code” drop down can be renamed as Dialect and in its drop down list in addition to the language code its full name can be displayed. For example en-IN English (India)
In the Voices list it is better to write all names in English. For example, at present Hindi voice names are written in Hindi alphabets. The user may not have configured the screen reader to speak different languages.
What happens after I select one language and voice? I assume that I should be able to select another language with voice and it will be listed in the table below. Currently this did not happen. I think you can add buttons like “add a TTS voice”, “Reset/Delete” in this screen so that users can setup more than one language with a preferred voice for it.

Thanks,

Prashant

marisademeglio commented 1 month ago

thanks @prashantverma2014 for having a look!

I will implement your suggestion for language name display.

As for selecting a voice, if you pick one from the drop down, it should appear with a checkbox that says "select as a preferred voice"

then in the table of preferred voices you can pick one for each language group to be the default, eg one english voice for dialects that have no specific setting.

but if this isn't working then maybe it's a browser issue. The actual UI should behave ok in this respect.

as for voice names, we don't control that as far as I am aware. That info comes from the TTS engine directly.

bertfrees commented 1 month ago

@marisademeglio I like the interface. I'm only not sure if marking one voice as default for a language going to be enough. What will it mean to be default? There is region, but there is also gender and age. Note that Pipeline attaches more importance to gender and age than to region/accent when selecting voices.

So I don't know whether or not it may be useful to have multiple "defaults" per language. We could e.g. allow one default per language/gender/age combination? I think that might make sense.

bertfrees commented 1 month ago

Case 3: In Pipeline app, we select Indian English and Australian English voice. Document has text in Indian English and US English Than: Since there is a match of Indian English and no match for US English, entire document is recorded in Indian English voice is used and Australian English voice is ignored.

This is something we can do automatically. Selecting voices for all sentences is already done before the sentences are narrated, so this might be feasable.

Being able to select a default voice for a language will still be needed though, for other use cases. But both features are compatible AFAICS, that shouldn't be a problem.

Nevertheless it seems too big of a change for this short development sprint, and it would need extensive testing. So I think this is something for a following release.

About what I said before:

There is region, but there is also gender and age. Note that Pipeline attaches more importance to gender and age than to region/accent when selecting voices.

Perhaps for now it is sufficient if we include gender/age in the interface. That might make it clear for users that "default" does not mean "when there is no exact match for a given dialect", but rather "when there is no exact match for a given age/gender/dialect".

marisademeglio commented 1 month ago

That sounds like it could work - one thing though, does the endpoint return age info? I don't remember seeing it.

bertfrees commented 1 month ago

Age and gender is actually combined in a single attribute "gender" in the web service, sorry for the confusion. Attribute can be * (neutral) / male-adult / male-child / male-elderly / female-child / female-adult / female-elderly.

I don't know how it is best presented in the UI. When age is specified In CSS, it is specified in combination with gender, but not in a single keyword: https://www.w3.org/TR/css-speech-1/#typedef-generic-voice. (Note that this is not the exact CSS syntax that is currently supported by Pipeline, but I want to become compatible with this syntax.)

marisademeglio commented 1 month ago

Does this make sense?

3 English voices are "preferred": Ava, Ananya, and Aarav.

2 of them are "default" for English: Ananya and and Ava

1 of them is "high" priority, as indicated by the user: Ananya

and the configuration looks like this:

 <config>
                <voice engine="azure" name="Ananya" lang="en-IN" gender="female-adult" priority="2"/>
                <voice engine="azure" name="Ananya" lang="en" gender="female-adult" priority="4"/>
                <voice engine="azure" name="Ava" lang="en-US" gender="female-adult" priority="1"/>
                <voice engine="azure" name="Ava" lang="en" gender="female-adult" priority="3"/>
                <voice engine="azure" name="Aarav" lang="en-IN" gender="male-adult" priority="1"/>
  </config>

I don't know what scenario the user is facing when they indicate normal/high priority as well as default=yes or default=no.

But this represents what I've heard we need in the TTS settings, from this issue and also #169.

Below is a screenshot of what the dialog currently looks like. A brief description is:

Top of dialog: Series of drop down boxes for finding and adding voices to the preferred voices table

Bottom of dialog: Preferred voices table, with voice info for each and options to make a voice the "default", to set its priority (high/normal) and to remove it from the list.

bertfrees commented 1 month ago

The way I had understood we were going to do it, is that we were going to allow all preferred voices to be used across regions with the same language, and that "default" just was a different word for "higher priority". I didn't expect two settings.

So, in my understanding, in the example with the three preferred English voices, setting the female Indian English voice Ananya as default (high priority) would result in:

 <config>
    <voice engine="azure" name="Ananya" lang="en-IN" gender="female-adult" priority="2"/>
    <voice engine="azure" name="Ananya" lang="en" gender="female-adult" priority="2"/>
    <voice engine="azure" name="Ava" lang="en-US" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Ava" lang="en" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Aarav" lang="en-IN" gender="male-adult" priority="1"/>
    <voice engine="azure" name="Aarav" lang="en" gender="male-adult" priority="1"/>
  </config>

This means:

Aarav is used when a male English voice is requested.
Ava is used when a female American English voice is requested.
Ananya is used when any other (female) English voice (e.g. British English) is requested.

What scenarios can we think of that require controlling which preferred voices can not be used across regions, or that require more than two levels of preference?

marisademeglio commented 1 month ago

Yes I have been confused by separating default vs priority and initially I had implemented what you described, e.g. "default" = higher priority in the config file. But then I went back and read through all the comments here and I was afraid I'd missed something because you mentioned you might still want 2 priority levels.

But if that's not required in addition to default-ness then it gets way simpler which is great!

bertfrees commented 1 month ago

Some suggestions:

In the voices dialog, you capitalize the engine name. Engines also have a "display name". I will make that available in the web API.
The gender/age options could maybe be presented better also. Instead of "Female-adult" etc., maybe something like "Female adult" (or just "Female" or "Woman"), "Female elderly" (or "Old woman"), "Female child" (or "Girl"), "Neutral" (or "Gender neutral"), ...

In fact "Unknown" would be more accurate than "Neutral". Voices marked "neutral" (or "*", same thing) match any requested gender, so it's effectively like a wildcard. But in practice this category is used primarily for voices for which we can't automatically determined the gender, notably the macOS voices. We should probably make two separate categories.

(Actually it might be useful to be able to select the gender for which the macOS voice is to be used when it is selected as a preferred voice.)

Perhaps another possibility would be to have two options, one for gender and one for age, although that might become a bit much.
The list of preferred voices basically represents the voice config XML. In that regard, I thought it might be a good idea to present it in such a way that it makes it more obvious how the voice selection will work. This is currently left up to the user to guess, and I think it's pretty intuitive so should be fine, but still... The user might not realize at first that an American English preferred voice will be used for any English content (unless another preferred voice is selected as default). So perhaps each preferred voice could have a column or ℹ button that tells you in plain English when the voice will be used, similar to how I did it in my comment above.

One final thing: I noticed that while searching for a voice I'm kind of missing seeing the filtered voice list/table instantly, like we had previously. E.g. you used to be able to see all the English voices with corresponding accent and gender, without having to decide on an accent and gender first. That is not possible anymore. But I guess that is the sacrifice for having a simpler and more accessible interface.

bertfrees commented 1 month ago

I noticed a small hitch: when you first select an engine to filter by, then a language, the gender/age and voice options are not updated.

marisademeglio commented 1 month ago

Oh I can't reproduce that at all, I just tried a few different combos.

initially showing all voices:

select an engine and get fewer voices:

bertfrees commented 1 month ago

And does the "Clear default for English" work for you?

marisademeglio commented 1 month ago

And does the "Clear default for English" work for you?

Yeah, no issues. It doesn't work for you? Maybe delete your settings file and restart the app? But first post your settings file here and let me see what's going on with it. Minus any API keys ofc.

bertfrees commented 1 month ago

Ah it seems to have an effect on settings.json and ttsConfig.xml, the UI is just not updated. I need to close and reopen the settings window.

bertfrees commented 1 month ago

Unfortunately there is still an issue with the voice selection. It's due to my wrong advice :disappointed:.

I think it needs to be done as folows (but let me first double-check it):

Preferred: priority 1
Default: second voice with lang = primary language + priority 2

The way we are doing it now results in the default voice being used even when there is a more suitable preferred voice. My bad :disappointed:

marisademeglio commented 1 month ago

Unfortunately there is still an issue with the voice selection. It's due to my wrong advice 😞.

I think it needs to be done as folows (but let me first double-check it):
* Preferred: priority 1

* Default: second voice with lang = primary language + priority 2
The way we are doing it now results in the default voice being used even when there is a more suitable preferred voice. My bad 😞

Ok no problem to change it, so the ttsConfig XML file would look like this then?

 <config>
    <voice engine="azure" name="Ananya" lang="en-IN" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Ananya" lang="en" gender="female-adult" priority="2"/>
    <voice engine="azure" name="Ava" lang="en-US" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Ava" lang="en" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Aarav" lang="en-IN" gender="male-adult" priority="1"/>
    <voice engine="azure" name="Aarav" lang="en" gender="male-adult" priority="1"/>
  </config>

Where Ananya is default and Ava, Ananya, and Aarav are preferred?

bertfrees commented 1 month ago

No like this:

<config>
    <voice engine="azure" name="Ananya" lang="en-IN" gender="female-adult" priority="2"/>
    <voice engine="azure" name="Ananya" lang="en" gender="female-adult" priority="2"/>
    <voice engine="azure" name="Ava" lang="en-US" gender="female-adult" priority="1"/>
    <voice engine="azure" name="Aarav" lang="en-IN" gender="male-adult" priority="1"/>
</config>

But I need to double-check it, don't want to be wrong this time.

bertfrees commented 1 month ago

For a next release: it might be useful to be able to select one default voice for each language/gender combination.

MDipendra commented 1 month ago

I confirm that we were able to test out 2 cases and run the TTS conversion successfully.

In first case, the document was in US English and I set Indian English as my preferred voice for English. Resulting audio recording came in Indian English voice.

In second case, I had document in US English and hindi. We chose Indian English as preferred voice for English and no other voice for chosen. Audio recording recorded English in Indian English and chose Hindi voice on its own to record Hindi text.

In Third case we added a Hindi voice too as preferred voice for Hindi and that voice was used in the recording.

However, we faced issue with Save As DAISY in correct markup of Hindi text in the document and had to do markup of Hindi text manually inside the DTBook XML document to get desired results. Without the correct markup, Hindi text was not read out and skipped.

daisy / pipeline-ui

Use a TTS voice across regions with the same language #170