2noise / ChatTTS

A generative speech model for daily dialogue.
https://2noise.com
GNU Affero General Public License v3.0
32.12k stars 3.49k forks source link

what are the emotional tokens that are supported? #769

Open robbiemu opened 1 month ago

robbiemu commented 1 month ago

The documentation, in advanced usage, seems to indicate we can add emotional tokens to guide the voice. IS there more documentation on this? What are the set of emotional tokens supported (often there are at least 7)

fumiama commented 1 month ago

Now there's only [laugh] available. Maybe we will release a new model with more emo. tokens later.

robbiemu commented 1 month ago

ooh really? There used to be other tokens apparently though: https://www.youtube.com/watch?v=MpVNZA6__3o (timestamp> 2:54, and see 5:00 for an example that seems pretty convincing in playback)

fumiama commented 1 month ago

The other tags are just speeed, oral, etc. Of course they are emotional tags too but often available in other TTS so I didn't mention them. What I mean is that the other emotional tags like anger, sigh, etc. are not available.

robbiemu commented 1 month ago

yes, some at least also support a strength modifier. is there documentation about the forms and configurations of these that you could link to, if it is common? No need to rewrite common work, but it would be great for us if we knew what the ground rules were

fumiama commented 1 month ago

You can refer to the tokenizer folder in HuggingFace repo of ChatTTS. There are some json files describing all tokens we use, including special tokens. But these tokens may not work because the lack of dataset. As for all working special tokens, they have been given in the example codes in README.

robbiemu commented 1 month ago

thanks that is an exhaustive list (its not in this repo? pardon the naivety, what is the different between the hf model and the source here).

I do still think there's an issue but at this point I feel like maybe this repo isn't where this belongs, or if so, I could help. Looking at these tags:

 {
      "id": 21137,
      "content": "[uv_break]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },
    {
      "id": 21142,
      "content": "[laugh]",
      "single_word": false,
      "lstrip": false,
      "rstrip": false,
      "normalized": false,
      "special": true
    },

These were two I saw in the video. Neither has a description but in this case I can work out what they are. This json had to come from somewhere, is there a way to find more about the tags so this could be documented? (I just noticed the strength modifiers for laugh are separate entries -- so I have one less question :) )

fumiama commented 1 month ago

The tokenizer is part of the ChatTTS model and it has been defined before training. As for the modifier, they are simply some separated series of special tokens, which include name, _ follows, then the number. All tokens we used are in that folder and you can ignore most of them because they are just the normal tokens.

robbiemu commented 1 month ago

it seems like the numbers just refer to different samples, and not really strength. And that multiple params effect each other in some way -- I noticed I was able to get 3 different laughs, at two different positions in my sample, by changing the numbers of orl and break, using those three. It's not clear to me what the calculus is, after playing with it for a bit.

Also, if I may ask, during generation, I notice relatively small maximums:

text:   6%|▋         | 24/384(max) [00:00, 47.12it/s]
code:   9%|▉         | 187/2048(max) [00:02, 66.77it/s]

this is just a maximum input token length, right? it is all being done locally? (is there a tool to get the number of tokens for my text?)

fumiama commented 4 weeks ago

It's not clear to me what the calculus is, after playing with it for a bit.

They're just some special prompts to the GPT.

this is just a maximum input token length, right?

No, they're the max infer iter.

it is all being done locally?

Of course.

is there a tool to get the number of tokens for my text?

Basically 1 word per token. You can refer to the code for more details.

robbiemu commented 4 weeks ago

Thank you for all the help.. just one bit to request:

this is just a maximum input token length, right?

No, they're the max infer iter. ...

is there a tool to get the number of tokens for my text?

Basically 1 word per token. You can refer to the code for more details.

the latter question here relates to the former, so I was saying tokens but apparently correct is "infer iter".

When you say iter, do you mean like iteration .. like, raw word count (splitting the line on spaces and maybe hyphens etc)?

fumiama commented 4 weeks ago

The iteration means inferring steps count of GPT. If you have tried ChatGPT, you can see that it will reply you word by word. Basically, when GPT finishes one iteration, one word (or token, strictly speaking) will come out.

robbiemu commented 4 weeks ago

My question is meant to find out how we can avoid sending too long of a string to the model. I am looking for a c way to intelligently chunk it, so I need to know how to measure a string and find out if it will fit

fumiama commented 4 weeks ago

Just use the word count.