DigitalPhonetics / IMS-Toucan

Controllable and fast Text-to-Speech for over 7000 languages!
Apache License 2.0
1.43k stars 160 forks source link

Magic number 66 #14

Closed michael-conrad closed 2 years ago

michael-conrad commented 2 years ago

I was testing adding tones and lengths as features and discovered that there is a hard coded check for '66' features, the number appears to be hard coded without explanation in multiple locations.

Would it be safe to assume that any '66' is for the feature count checking only?

Flux9665 commented 2 years ago

Correct, there are 42 features in our own lookup and then 24 more that we get from a resource called Panphon, which also encodes phonemes as vectors. For some symbols, this other resource behaved strangely, so I added those checks to figure out why the panphon vectors sometimes have different numbers of dimensions. When I rework the featurevectors for tone and lenthening, I'm probably going to remove the panphon vectors completely for simplicity.

michael-conrad commented 2 years ago

I've started using constants for these, and pass this constant to the models for the matching dim parameter.

Is that correct?

Flux9665 commented 2 years ago

You mean the number of dimensions in a featurevector is constant? If so then yes. If you use different featurevectors with a different amount of dimensions, then you can just replace all those with the new expected amount of dimensions.

michael-conrad commented 2 years ago

I'm wanting to at some point test converting the transcript text directly to byte values, would simply setting the feature entries up as {"symbol_type", "byte", "b0": 0/1, "b1": 0/1, ...} as a direct conversion from bytes work?

Flux9665 commented 2 years ago

I don't se a reason why it wouldn't work, although I think there could be an easier way by not using the articulatory vector pipeline at all and converting the sequence of bytes directly to a sequence of LongTensors.