MTG / WGANSing

Multi-voice singing voice synthesis
236 stars 44 forks source link

Small but important problems of this repo. #19

Open ysig opened 4 years ago

ysig commented 4 years ago

Hi,

I would like to make some minor comments on this repo:

  1. In the README you don't specify what is the usability of this library, in prediction mode: what is you input and what is your output. I was curious for example if you could give a new voice and a text and make this voice sing that text. Or if you could give a singing voice and style transfer it to another singer from his or her speaking voice. It remains unclear to me. How much is it customizable and what is it's use case.
  2. Secondly your requirements file is missing dependencies as matplotlib and tqdm and cython (for pyworld) and also there are version conflicts inside it, that need to be resolved by manually installing and downgrading packages (such as tensorboard). Also you should already have to have installed numpy in order for pyworld to compile (a classic problem fo Cython compilation - it's not on you of-course).
  3. Although the config.py and the download links are well set, the help of the main.py (for example of how to predict and what - eval mode conflicts with wave mode in arguments in what you have inside your README) is not that helpful and for example I was curious what should it be for the folder voice_dir = ../ss_synthesis/voice/

Thanks in advance!!

pc2752 commented 4 years ago

Hi, thanks for your interest in our work and for pointing this out :-), I'll update soon.

otezun commented 4 years ago

Hi,

great project but the caveats make it pretty hard to use. I've documented my install process here (hopefully completely): https://github.com/otezun/WGANSing-Personal-Install-Notes/blob/main/README.md

Checking the training data, it's the phonemes including it's corresponding time stamp, so I assume the .lab file it generates voice from would be formatted as: <starttime>-<endtime> <phenome> Is that assumption correct? As example taken from training data: 0.000000 7.089788 sil

I think this would have great potential with better documentation.

pc2752 commented 4 years ago

Hi Otezun and ysig,

Thanks for you interest in our work and the documentation, looks great. I'll update the main README with your suggestions.

otezun commented 4 years ago

Thanks for the quick answer @pc2752. I have been able to train the 950 epochs, taking a bit more than 12 hours on my machine (MSI 1660 Armor OC 6GB, Ryzen 5 3600, 64GB RAM). I used the nus_ZHIY_sing_06 and translated it to MPOL. The resulting files can be downloaded from my repo, along with the figure. Here is another thing: When synthesizing, it vocodes the output to val_dir_synth, but the filenames do not include information what was vocoded to what. Instead of file names like nus_ZHIY_06.ouput, a name like nus_ZHIY_06_MPOL.output would be better as you wouldn't accidentally overwrite any previous files you have made of that singer (rather, you'd overwrite files where you already combined that singer with that particular subject). Next, it states in the README that it expects a .lab file. This is not true. Instead it expects a .hdf5 file from the dataset. I know this as I tested it against the .lab file from the torch_npss repository, which it would not allow me to do.

Kerry0123 commented 3 years ago

hi, to generate singing voice, it expects a .hdf5 file from the dataset. Generated .hdf5 needs wave file, Can it not use wave files?