Rayhane-mamah / Tacotron-2

DeepMind's Tacotron-2 Tensorflow implementation
MIT License
2.26k stars 906 forks source link

I decided to abandon this framework for the time being #226

Closed ErfolgreichCharismatisch closed 5 years ago

ErfolgreichCharismatisch commented 5 years ago

Reasons

I am disappointed, honestly.

tugstugi commented 5 years ago

could you share your dataset? Maybe something wrong with your dataset?

ErfolgreichCharismatisch commented 5 years ago

I cannot share it, because it is not open source. What I did was

  1. to cut high quality audio at silences longer than X ms,
  2. to then have google speech recognition convert it to text snippets (Filename|transcribed text|copy of transcribed text)
  3. to then correct those text snippets manually.

It would help tremendously to have a good tutorial about how to create good input files for this to not be a source of error.

tugstugi commented 5 years ago

How long is your dataset? I made myself 5 hours Mongolian dataset and trained successfully. Only thing I have to change was to lower fmin and update the vocabulary for Mongolian. I resampled also the audio files to 22050 to keep it compatible with LJSpeech.

ErfolgreichCharismatisch commented 5 years ago

About 2.2 hours. Yet several people claimed to have had success with datasets even below 1 hour.

tugstugi commented 5 years ago

Could you share at least a few audio samples from your dataset?

ErfolgreichCharismatisch commented 5 years ago

What are you aiming at in those files?

tugstugi commented 5 years ago

Maybe look for obvious errors? OK I give up :)

ErfolgreichCharismatisch commented 5 years ago

Which errors are you talking about? Not only can I not share them for legal reasons, but it also wouldn't help anyone else. It would help tremendously to have a good tutorial for all kinds of languages and input sources about how to create good input files for this to not be a source of error.

Rayhane-mamah commented 5 years ago

Hi, first of, thanks @tugstugi for your assistance, much appreciated :)

@ErfolgreichCharismatisch I am sorry you feel that way. Let me just correct few misunderstandings here and there:

At the end of this long boring comment, I will simply give my quick notes that you may find helpful:

Thanks for trying our work, we hope to have some positive feedback from your end in the future!

ErfolgreichCharismatisch commented 5 years ago

Dear @Rayhane-mamah, put yourself into my or any other beginner's shoes. Which kind of tutorial would help you get started with your own data? PS: Never use the phrase "I am sorry you feel that way".

Thien223 commented 5 years ago

@ErfolgreichCharismatisch Dear friend,

I think this is a good one to start.

You are going so fast. Why did not you use his data, try to understand first, then you can use yours.

Like you, I'm a newbie. I started by getting his code, and make it runs. At the begining, It did not run, for some reasons (I've made nothing change, though). I searched for problems (they were my machine problems, such as my machine does not has GPU, some packagew were not installed right...).

When it run, I try to change something, a little at once, see the different, and understand what is the code parts used for.

Now I can run the project with even Korean language (and only 30 minutes of training data, trying to reduce more).

Be patient friend. you can do it. There is no problem with the code. (I have not checked out the updated version).

Hayes515 commented 5 years ago

Hi,@Rayhane-mamah I used LJ datasets ,and don't modify the value of tacotron_batch_size. When I train tacotron model, it is ok. but I continue to train WaveNet model, OOM is present. After I try to decrease wavenet_batch_size to 2 ,OOM error disappeared. I am not sure if it has other bad influences for changing wavenet_batch_size? Tacotron model will take one more day for finishing training ,I will start to train WaveNet model tomorrow afternoon.

ErfolgreichCharismatisch commented 5 years ago

@Hayes515 Please create your own thread.

ErfolgreichCharismatisch commented 5 years ago

@tdplaza Great it works for you. How do you go about creating a new corpus, please be detailed.

tacobeer commented 5 years ago

Testing this model can be challenging at first. I did that, too.

But... when it comes to Speech synthesis, this framework is a best place to learn about speech synthesis in github and i know no one would make this level of repository non-commercial.

Thien223 commented 5 years ago

After getting the code runs fine. I realized that to apply to my own corpus, I have to well prepare metadata.csv file. and write a module to preprocess the text.

The metadata.csv file has 2 infomation: wav file name, and the text.

see how the program get and process these info in build_from_path function.

My dataset has transcript with other format. So instead of change the transcript format due to metadata.csv file, I changed the function to read my transcript.txt, return exactly what the function wants (they are the text, wav path and index).

In case of text processing. I realize that the processing modul has 1 task: transform input text to sequence array (the inverse function sequence to text is not used for training, it used for loging information, you can bypass the inverse function). Before transforming, the english_cleaner converts number, special character, monetary... to text. So, I have to find a module, that helps me tranform number to text, deal with currency number, special character... and integrate it into another module that helps me transform Korean text to sequences.

When the code could not run fine. I have to find where is the problem. By using melspectrogram function to convert a wav to mel, and inv_mel_spectrogram to convert mel to wav .... I have applied directly them on the mel that was generated by preprocessing task (.npy files) to test if the preprocessing ran well.

Also I directly used these function to convert a wav file to audio array, and audio array to mel array, then I convert mel back to wav file to check if they are well functioning.

These are some simple tricks. There are a lot of things you can do to debug. You have to do it yourself. Such as your GPU has less memory, so you have to calculate how many batchs it can process at once. check what is the shape of melspectrograms, multiply with the batch size, and number of sample in 1 batch. says:

[80,1200] x 32 x 48 where

80: (mel channel) 1200: (mel frames) 32: (number of batches) 48: (number of samples per batch). Mel spectrogram has float32 or int32 data type, then you can calculate 1 number holds how much memory. And how much memory all the batches hold. 1 float32 occupies 32 bits. then above mel occupies 32 x 96000 bits ~ 0.382 Mbyte. Then to process above batches you need 589.824 MBytes. <~~ This is the way I would do if I was you.

Here, everything is easy, you can decrease mel channels, drop utterances has many mel frames, decrease batch size, or cut out the number of samples in 1 batch. Try changing them slowly, see how does it affect to mel quality. choose the best for your machine.

ErfolgreichCharismatisch commented 5 years ago

This is awesome @tdplaza. Do you want to add anything, @Rayhane-mamah?

gsoul commented 5 years ago

@ErfolgreichCharismatisch With all due respect, it seems to me that you confuse enterprise-level support with hobby-level OS project which is free of charge. None of the participants here owe you anything. So I think it'd be highly appreciated if you could change your tone to more polite. And show your appreciation for all the hard work that was put into making this repository a reality.

Alternatively, you could pay Rayhane-mamah for his consultation, if he has the time, where he'll be able to answer any of your questions. I'm not aware of Rayhane-mamah's rates, but for enterprise-level support especially in ML I think 200-500usd/hr is not something extraordinary.

Please let us know, if you prefer the consultation, so that community wouldn't spend more time here, as you'll get all the needed answers in private consultation.

m-toman commented 5 years ago

I agree with @gsoul - I've been really impressed by how many hours Rayhane-mamah put into this. There were Saturdays where my mail account was just flooded by the repo notifications where he's seemingly been answering issues for 5+ hours straight. That's unpaid time that could as well be spent with the family, earning money (or developing :), instead of answering).

Generally we are now in the luxurious situation that deep learning enthusiasts rush into the field and produce lots of open source material. When I did my doctorate in speech synthesis, there was more or less HTS, Festival and MaryTTS to choose from - with a much steeper learning curve (it's crazy how many hundred thousand lines of C++ and Scheme code Tacotron replaces). To dig deeper into TTS in general, I can recommend the page of Simon King (http://www.speech.zone/) or "Text-to-Speech Synthesis" by Paul Taylor.

ErfolgreichCharismatisch commented 5 years ago

@gsoul, with all due respect, it seems to me that you confuse asking for a tutorial with writing a documentation of a paid product. I don't owe you anything either. So why don't you just watch your tone and actually contribute something useful like @tdplaza? I cannot honestly show any appreciation for something that doesn't work for me. Also, where is your appreciation? You came here just to lecture me unsolicitly whereas you didn't do anything to deserve that position in the first place - nobody does.

We both know that Rayhane-mamah did not publish this only because he is such a great guy. He wants to put this on his resumeé. And it would work way better for him if he made entry simpler, more forks, more exposure, more job offers.

Also how would private consultation benefit anyone else?

ErfolgreichCharismatisch commented 5 years ago

@m-toman I am pretty sure this is a great framework, but if I cannot make it work and people don't puzzle together a tutorial, I couldn't care less about how much work he put in. And if you were honest, you would say the same.

Exactly because he used to support people so often with probably the same answers, there is an even bigger incentive to expand the wiki and point beginners to it instead of repeating himself.

Again, why don't you - being qualified as a Phd in the field - expand the wiki with how you made this framework work?

http://www.speech.zone is quite impressive, actually.

m-toman commented 5 years ago

I'm not involved with this framework except I fixed a small bug. So I don't see why exactly my free time (to write the thing) should be worth less than yours (to figure things out)? Considering that I could (and actually do) get money for doing work on speech synthesis instead... or just go and play with my daughter - who is more charming in asking for my time than you ;).

Generally I haven't worked much more with it than running the default LJ training (which more or less just worked as described).

ErfolgreichCharismatisch commented 5 years ago

@m-toman It is not about balancing each other's effort and time invested. I wouldn't mind you only playing with your daughter and not showing up here again to pester those who actually care about this project enough to help beginners.

Rayhane-mamah commented 5 years ago

Alright this has been going for long enough.

Dear @ErfolgreichCharismatisch.

We happily accept all sorts of criticism as long as it's constructive and is delivered in a polite manner, I personally encourage such feedback as it helps me improve my work, and with it other works based on it. However, we do NOT tolerate any form of lack of respect, thing you have been doing on multiple occasions. Community is one of the most important aspects of open source, it would be a shame that a bad actor ruins this experience for the entire group. I am thus revoking you access from commenting or opening any further issues on this repository.

As stated earlier, your remarks will most certainly help improve this project and we will make sure to make our work's usage easier for others. While I believe your intentions are genuinely good, your execution seems to be the worst.. To make sure I am not being unfair (and because feedback is usually beneficial to all of us), here are the remarks on your attitude that lead me to take such drastic measures:

Please also keep in mind that most OS projects you will find out there are not 100% what you're looking for, and that you will need to make your own modifications (which translates to time..) to make them suit your needs. Please also do not expect continuous support from contributors as most of them are doing hobby projects on the side because they are passionate about what they do, and they want to share this passion with others.

With that said, if you do not like our work, you are still free to use other's works, no one is forcing you to use ours I believe? In fact, here are some awesome other contributions that you can use:

I apologize to anyone offended by this "issue" and thank you @tdplaza @m-toman @gsoul @piligram for your assistance and contributions!

@ErfolgreichCharismatisch Please avoid having a negative attitude in others' repos as it is no fun for anyone. Thanks for your understanding, sorry it had to come down to this.


Other than that,

@Hayes515 batch_size 2 with wavenet usually is not a big issue, it will just take longer to converge, but I believe I stabilized gradients as best as possible to allow the model to hit proper minimum. of course if you do face problems with it, please open an issue and we'll look into it. Thanks for reaching out! :)