TensorSpeech / TensorFlowTTS

:stuck_out_tongue_closed_eyes: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (supported including English, French, Korean, Chinese, German and Easy to adapt for other languages)
https://tensorspeech.github.io/TensorFlowTTS/
Apache License 2.0
3.8k stars 810 forks source link

Dataset EOS symbol issue #188

Closed mapledxf closed 4 years ago

mapledxf commented 4 years ago

@dathudeptrai Hey, I have a question.

I can see Korean script has been added. However, the following code is still using the LJSpeech's EOS symbol https://github.com/TensorSpeech/TensorFlowTTS/blob/7ae6e4e1f71a4363461dcecc018a3c4eff7ba409/examples/tacotron2/tacotron_dataset.py#L142

It might be wrong if some modules use different EOS symbol in their script.

dathudeptrai commented 4 years ago

Yeah, i will solve this confuse issue. KSS for korea and Baker for chinese also use eos in the end of symbols so atleast rightnow there is no problem :D. I will re-design the processor :)))

tekinek commented 4 years ago

Yes, this code can be an issue. In my model, I created a custom processor for my language bu didn't know that above code should be changed accordingly. Lucky I kept the same symbol for eos and used less symbols than ljspeech processor, so that eos id can still be consistent in prepossessing and inference. Actually above code would cause error if number of custom symbols is larger than ljspeech (149).

@dathudeptrai we love to see best separation of processors and models, so that one simple configuration works for everywhere. Hopefully I can contribute to this project though it is not that convenient at this moment. Thanks! Mozillla TTS can be a good reference.

dathudeptrai commented 4 years ago

@tekinek yeah, we already know an issue, let us disscuss about this problem and re design processor class.

dathudeptrai commented 4 years ago

@mapledxf @tekinek we are trying to solve this problem by introduce the base_processor class. With the new abstract class, [eos, bos, unk, pad] will be an attribute of the processor class. Pls take a look https://github.com/TensorSpeech/TensorFlowTTS/blob/base_processor/tensorflow_tts/processor/base_processor.py. Let discuss.

tekinek commented 4 years ago

@dathudeptrai It is good to have base_processor.py. My concern is on overall structure of this project, which is still bit confusing to me. Let me write down what has crossed my mind for the moment.

When we have a new dataset, we do preprocessing to prepare training data. This process relies on a dataset-specific data loader, and a text util (let's call it frontend) to normalize transcript and convert it to numbers. Once the model is trained what we only need is the frontend that was used during the preprocessing. The frontend is language-specific but not dataset-specific; it is only responsible for defining symbol set in a language and producing unique ids for given text. Optically, it use a cleaner to normalize text.

Model definitions, such as these in tensorflow_tts/models, can not rely on preprocessing and frontend, but the reference (example) implementations of the models can. (So vocab_size param of models/tactron2 and models/fastspeech shouldn't be taken from frontend. Putting it into a config could also complicate the problem. My suggestion is to fix it with a integer 300, so that it is less likely a language needs more (maybe Chinese?))

From above perspective, current project structure is inconstant: tensorflow_tts is supposed to be core logic that does not depend on preprocessing and frontend. Currently, the preprocessing and frontend-related functions are placed inside.

Here is my recommendation for restructuring the project:

TensorflowTTS
|─tensorflow_tts                    # Core logic: model definitions, utils ...
|─preprocessing                    # Preprocessing logic
│    |─preprocess.py
│    |─configs                          # Dataset-specific configs for preprocessing
│         |─ljspeech.yalm
│         │─kss.yalm
│    |─dataset_loaders            # Dataset-specific loaders for preprocessing
│         |─base_loader                
│         |─ljspeech_loader.py
│         |─kss_loader.py
│
│─frontend                             # Language-specific frontend and cleaners for preprocessing and inference             
│   │─base_frontend.py
│   │─nglish_frontend.py
│   │─korean_frontend.py
│   │─cleaners                           
│        │─base_cleaners.py
|        │─korean_cleaners.py  
│          
│─bin
|    │─tacotron2
|    |    │─train.py
|    |    │─synthesize.py
|    |    │─extract_duration.py
|    |    │─configs
|    |         │─tacotron2.v1.yalm
|    │─melgan
|         │─train.py
|         │─synthesize.py
 ...
|─utils                                        # Project-wise global utils for io, audio processing, logging etc.
|─notebooks
|─tests
...

To support a new dataset xxx in a language with exiting frontend, one should do:

To support a new dataset xxx in a new language yyy, one should do:

To connect frontend and training/inference logic, we can specify "frontend" option in custom model config (e.g. frontend: "english_frontend" in /bin/tacotron2/tacotron2.v1.yalm)

Maybe we need to start a dev branch.

dathudeptrai commented 4 years ago

@machineko what do you think.

machineko commented 4 years ago

As we were speaking before we need to refactor repo for better Developer usability. Start from processing and dataset creation :P I Think base prepro branch will be first step in it :)

dathudeptrai commented 4 years ago

we should make AutoProcessor Class as AutoConfig and TFAutoModel. when we preprocessing the dataset, we should save a pretrained file for AutoProcessor then we can reuse it by AutoProcessor.from_pretrained(path). The symbols can be save easily, the hard part is how can we save text_to_sequence to reuse later. In some cases, user change the function text_to_sequence frequently so to keep the code unchange, we need make AutoProcessor reuse :)). @machineko

@tekinek i know ur point, as the point view of developer. But we should make the inference process can do by pip install. The processor should be in the main tensorflow_tts to be reuse. The dataset-specific and preprocessing code can be move out to tensorflow_tts. My mistake is that when i create examples dir, i just think it's just an example and it's not general. So in the case user train his own language, he need create other example and the code in master is just for reference. I want this because i want user understand deeply the model, data-loader, ... rather just git pull and python train.py. I don't want make the framework become cookbook. For example, if you want to train tacotron for korean, you should create examples/tacotron2_korean/... The code in examples/tacotron2/... is just reference, you should change ur symbols, change the logic of data-loader, and change the train.py file based on ur case :)). The eos symbols shouldn't be a problem in this case. Maybe i need clarify this and tell user make their own examples dir rather than reuse the examples master code.

tekinek commented 4 years ago

@dathudeptrai Recently added AutoConfig & TFAutoModel and their placement in tensorflow_tts/inference are nice.

In some cases, user change the function text_to_sequence

why? for multiple languages?

For text_to_sequence reuse, saving the char-to-id map into config along with pretrained may help.

when i create examples dir i just think it's just an example and it's not general. So in the case user train his own language, he need create other example and the code in master is just for reference. I want this because i want user understand deeply the model, data-loader, .... rather just git pull and python train.py. I don't want make the framework become cookbook

I think as this project is becoming stable, many people hope an easy pip&train$inference works for their languages without much engineering issues (at least as a first step). What matters is documentations for different levels and use cases. "examples" or "bin" both are fine, since the researchers can easily figure out how to start their own framework, but they too need a short path for processioning and fronted so that they can focus more on model itself.

dathudeptrai commented 4 years ago

why? for multiple languages?

Example for chinese, there are so many way to represent text. charactor, phoneme + tone, pyyinyin, ... or in case you want to train multi lingual, code switching, ...

For text_to_sequence reuse, saving the char-to-id map into config along with pretrained may help.

Save symbols is ok, but how can we save a logic of this function ?. But saving only symbols is ok for now :D.

I think as this project is becoming stable, many people hope an easy pip&train$inference works for their languages without much engineering issues (at least as a first step). What matters is documentations for different levels and use cases. "examples" or "bin" both are fine, since the researchers can easily figure out how to start their own framework, but they too need a short path for processioning and fronted so that they can focus more on model itself

Okay, we will make it more general. Let begin with processor first :(.

tekinek commented 4 years ago

@dathudeptrai I see. So your point is how to bind text_to_sequence function to a pretrained model for inference when the frontend logic is outside the tensorflow_tts (pip restriction? please correct me if I am wrong).

tekinek commented 4 years ago

Besides, I have yet to work on Chinese TTS. If character is used to represent text, than model needs uncertainly big number of vocab_size. This is also a consideration.

dathudeptrai commented 4 years ago

@dathudeptrai I see. So your point is how to bind text_to_sequence function to a pretrained model for inference when the frontend logic is outsize the tensorflow_tts (pip restriction? please correct me if I am wrong).

Yes. I want the code as bellow:

from tensorflow_tts.inference import AutoProcessor

processor = AutoProcessor.from_pretrained("PATH_SAVED_PROCESSOR")
ids = processor.text_to_sequence("....")

if text_to_sequence is unchange, saving only symbols is ok :D. BUt i think there are some people try to train the model with many kind of text (charactor, phoneme, phoneme + tone, IPA, pyyinyin, ...) so if we can bring a text_to_sequence along with pretrained model, we will completely reproduce in the inference stage. Conclusion, we need processor_pretrained, config_pretrained, model_weight_pretrained to inference in all cases without care a problem about missmatch each version :D.

tekinek commented 4 years ago

@dathudeptrai what about creating a separate repo for frontends, where people can find many text processors for their languages and choose one for their model. So the your code can be something like this:

!pip install TensorflowTTS-frontend !pip install TensorflowTTS

from tensorflow_tts.inference import AutoProcessor

processor = AutoProcessor.from_pretrained("PATH_SAVED_PROCESSOR", frontend=None)
ids = processor.text_to_sequence("....")

for user

from tensorflow_tts.inference import AutoProcessor
from TensorflowTTS import  chinese_frontend_piyin

processor = AutoProcessor.from_pretrained("PATH_SAVED_PROCESSOR", frontend=chinese_frontend_piyin)
ids = processor.text_to_sequence("....")
ZDisket commented 4 years ago

@dathudeptrai

But we should make the inference process can do by pip install

With or without pip Tensorflow?

dathudeptrai commented 4 years ago

@ZDisket with pip install TensorFlowTTS.

@tekinek that also ok, we will try to finish processor problem in this week :D. Then we will move all dataset-related to process directory :D.

manmay-nakhashi commented 4 years ago

@dathudeptrai we can also try something like this ids = processor.text_to_sequence("....", language = ['en_us','chines','ko',......etc], scheme = [symbols, phoneme], scheme_type = [1,2,3......]) under each language we can add different processors and [symbol schemes or phoneme schemes] also we can generate symbol to id mapping script for particular symbol sets , automatically from given symbols set , generating phoneme script might be difficult because we need g2p converter model.

tekinek commented 4 years ago

@manmay-nakhashi it is nice, I think.

@dathudeptrai By the way, I suggest following "About" for this repo: TensorFlowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2 (Out of the box support for English and Korean; easily adaptable for others)

mapledxf commented 4 years ago

@dathudeptrai I think it is better to save the symbol table to a local file. And this file is bound to the module since when we train a module, the symbol mapping might change, etc adding different punctuations. So in the inference stage, we can look up the ids from the file.

dathudeptrai commented 4 years ago

this problem is solved on #204. We do not need explicit adding eos token in the end anymore, it's automatically add into the end of text_to_sequence function. We also added AutoProcessor for all processor and load pretrained processor from file. Please see our new notebook or colab to see how it work :D. Pretrained for each processor uploads here pretrained_processor