Open robbiemu opened 1 month ago
Now there's only [laugh]
available. Maybe we will release a new model with more emo. tokens later.
ooh really? There used to be other tokens apparently though: https://www.youtube.com/watch?v=MpVNZA6__3o (timestamp> 2:54, and see 5:00 for an example that seems pretty convincing in playback)
The other tags are just speeed, oral, etc. Of course they are emotional tags too but often available in other TTS so I didn't mention them. What I mean is that the other emotional tags like anger, sigh, etc. are not available.
yes, some at least also support a strength modifier. is there documentation about the forms and configurations of these that you could link to, if it is common? No need to rewrite common work, but it would be great for us if we knew what the ground rules were
You can refer to the tokenizer folder in HuggingFace repo of ChatTTS. There are some json files describing all tokens we use, including special tokens. But these tokens may not work because the lack of dataset. As for all working special tokens, they have been given in the example codes in README.
thanks that is an exhaustive list (its not in this repo? pardon the naivety, what is the different between the hf model and the source here).
I do still think there's an issue but at this point I feel like maybe this repo isn't where this belongs, or if so, I could help. Looking at these tags:
{
"id": 21137,
"content": "[uv_break]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
{
"id": 21142,
"content": "[laugh]",
"single_word": false,
"lstrip": false,
"rstrip": false,
"normalized": false,
"special": true
},
These were two I saw in the video. Neither has a description but in this case I can work out what they are. This json had to come from somewhere, is there a way to find more about the tags so this could be documented? (I just noticed the strength modifiers for laugh are separate entries -- so I have one less question :) )
The tokenizer is part of the ChatTTS model and it has been defined before training. As for the modifier, they are simply some separated series of special tokens, which include name, _
follows, then the number.
All tokens we used are in that folder and you can ignore most of them because they are just the normal tokens.
it seems like the numbers just refer to different samples, and not really strength. And that multiple params effect each other in some way -- I noticed I was able to get 3 different laughs, at two different positions in my sample, by changing the numbers of orl and break, using those three. It's not clear to me what the calculus is, after playing with it for a bit.
Also, if I may ask, during generation, I notice relatively small maximums:
text: 6%|▋ | 24/384(max) [00:00, 47.12it/s]
code: 9%|▉ | 187/2048(max) [00:02, 66.77it/s]
this is just a maximum input token length, right? it is all being done locally? (is there a tool to get the number of tokens for my text?)
It's not clear to me what the calculus is, after playing with it for a bit.
They're just some special prompts to the GPT.
this is just a maximum input token length, right?
No, they're the max infer iter.
it is all being done locally?
Of course.
is there a tool to get the number of tokens for my text?
Basically 1 word per token. You can refer to the code for more details.
Thank you for all the help.. just one bit to request:
this is just a maximum input token length, right?
No, they're the max infer iter. ...
is there a tool to get the number of tokens for my text?
Basically 1 word per token. You can refer to the code for more details.
the latter question here relates to the former, so I was saying tokens but apparently correct is "infer iter".
When you say iter, do you mean like iteration .. like, raw word count (splitting the line on spaces and maybe hyphens etc)?
The iteration means inferring steps count of GPT. If you have tried ChatGPT, you can see that it will reply you word by word. Basically, when GPT finishes one iteration, one word (or token, strictly speaking) will come out.
My question is meant to find out how we can avoid sending too long of a string to the model. I am looking for a c way to intelligently chunk it, so I need to know how to measure a string and find out if it will fit
Just use the word count.
The documentation, in advanced usage, seems to indicate we can add emotional tokens to guide the voice. IS there more documentation on this? What are the set of emotional tokens supported (often there are at least 7)