NVIDIA / mellotron

Mellotron: a multispeaker voice synthesis model based on Tacotron 2 GST that can make a voice emote and sing without emotive or singing training data
BSD 3-Clause "New" or "Revised" License
853 stars 184 forks source link

Need more info for training and inference #7

Closed AndroYD84 closed 4 years ago

AndroYD84 commented 4 years ago

Hello, thanks for sharing this amazing repo! Could we have more information how to process our own data for training and inference, please? The inference demo works perfectly, but any attempt to use my own "musicxml" throws an error:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-20-d11c35a85dab> in <module>
----> 1 data = get_data_from_musicxml('data/haendel_hallelujah3.musicxml', 110, convert_stress=True)
      2 panning = {'Soprano': [-60, -30], 'Alto': [-40, -10], 'Tenor': [30, 60], 'Bass': [10, 40]}

C:\mellotron\mellotron_utils.py in get_data_from_musicxml(filepath, bpm, phoneme_durations, convert_stress)
    460             continue
    461 
--> 462         events = track2events(v)
    463         events = adjust_words(events)
    464         events_arpabet = [events2eventsarpabet(e) for e in events]

C:\mellotron\mellotron_utils.py in track2events(track)
    285     events = []
    286     for e in track:
--> 287         events.extend(adjust_event(e))
    288     group_ids = [i for i in range(len(events))
    289                  if events[i][0] in [' '] or events[i][0].isupper()]

C:\mellotron\mellotron_utils.py in adjust_event(event, hop_length, sampling_rate)
    230 
    231 def adjust_event(event, hop_length=256, sampling_rate=22050):
--> 232     tokens, freq, start_time, end_time = event
    233 
    234     if tokens == ' ':

ValueError: not enough values to unpack (expected 4, got 2)

I confirm that even trying to change one single letter on the "haendel_hallelujah.musicxml" lyrics (ie. "jah" into "yah") will throw an error, if I change it back to "jah" it works again, so I doubt it's my text editor fault or wrong musicxml format (there're tiny differences how the text is organized depending on which software it was exported from), I get this error:

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-19-f648a3b7ff04> in <module>
     18         with torch.no_grad():
     19             mel_outputs, mel_outputs_postnet, gate_outputs, alignments_transfer = tacotron.inference_noattention(
---> 20                 (text_encoded, mel, speaker_id, pitch_contour*frequency_scaling, rhythm))
     21 
     22             audio = denoiser(waveglow.infer(mel_outputs_postnet, sigma=0.8), 0.01)[0, 0]

C:\mellotron\model.py in inference_noattention(self, inputs)
    665 
    666         mel_outputs, gate_outputs, alignments = self.decoder.inference_noattention(
--> 667             encoder_outputs, f0s, attention_map)
    668 
    669         mel_outputs_postnet = self.postnet(mel_outputs)

C:\mellotron\model.py in inference_noattention(self, memory, f0s, attention_map)
    523             attention = attention_map[i]
    524             decoder_input = torch.cat((self.prenet(decoder_input), f0), dim=1)
--> 525             mel_output, gate_output, alignment = self.decode(decoder_input, attention)
    526 
    527             mel_outputs += [mel_output.squeeze(1)]

C:\mellotron\model.py in decode(self, decoder_input, attention_weights)
    382         self.attention_context, self.attention_weights = self.attention_layer(
    383             self.attention_hidden, self.memory, self.processed_memory,
--> 384             attention_weights_cat, self.mask, attention_weights)
    385 
    386         self.attention_weights_cum += self.attention_weights

C:\ProgramData\Anaconda3\envs\ptlast37\lib\site-packages\torch\nn\modules\module.py in __call__(self, *input, **kwargs)
    539             result = self._slow_forward(*input, **kwargs)
    540         else:
--> 541             result = self.forward(*input, **kwargs)
    542         for hook in self._forward_hooks.values():
    543             hook_result = hook(self, input, result)

C:\mellotron\model.py in forward(self, attention_hidden_state, memory, processed_memory, attention_weights_cat, mask, attention_weights)
     84 
     85             attention_weights = F.softmax(alignment, dim=1)
---> 86         attention_context = torch.bmm(attention_weights.unsqueeze(1), memory)
     87         attention_context = attention_context.squeeze(1)
     88 

RuntimeError: invalid argument 6: wrong matrix size at C:/w/1/s/tmp_conda_3.7_104508/conda/conda-bld/pytorch_1572950778684/work/aten/src\THC/generic/THCTensorMathBlas.cu:534

I tried training with my own audio data, files are in WAV format, 22050hz 16-bit mono, 1 to 4 seconds long, I pointed my data on the "ljs_audiopaths_text_sid_train_filelist.txt" and "ljs_audiopaths_text_sid_val_filelist.txt" file formatted like this on each line: data/speaker/audiofile1.wav|hello world|0 Used this command: python train.py --output_directory=outdir --log_directory=logdir -c models/mellotron_libritts.pt --warm_start But it throws this error:

Traceback (most recent call last):
  File "train.py", line 297, in <module>
    args.warm_start, args.n_gpus, args.rank, args.group_name, hparams)
  File "train.py", line 187, in train
    train_loader, valset, collate_fn, train_sampler = prepare_dataloaders(hparams)
  File "train.py", line 44, in prepare_dataloaders
    trainset = TextMelLoader(hparams.training_files, hparams)
  File "C:\mellotron\data_utils.py", line 45, in __init__
    self.speaker_ids = self.create_speaker_lookup_table(self.audiopaths_and_text)
  File "C:\mellotron\data_utils.py", line 52, in create_speaker_lookup_table
    d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
  File "C:\mellotron\data_utils.py", line 52, in <dictcomp>
    d = {int(speaker_ids[i]): i for i in range(len(speaker_ids))}
ValueError: invalid literal for int() with base 10: ''

Any information how to solve this is much appreciated, thanks!

rafaelvalle commented 4 years ago

@AndroYD84 I'll address the first issue here. Please create another issue for the training problem such that that we can address it there.

The musicxml parser we provided is a basic starting point for parsing musicxml files. The requirements are :

  1. The characters must be in [a-zA-Z]
  2. Each word must start with an upper case
  3. Every word must exist in the arpabet dictionary.

You're likely violating [3.] by changing letters of the word Hallelujah.

AndroYD84 commented 4 years ago

Thanks for the quick reply! I made a copy of "haendel_hallelujah.musicxml" and changed only the word "hal·​le·​lu·​jah" with "sys·​tem·​at·​ic" (this word appears in the arpabet dictionary and on the "cmu_dictionary" file from this repo) and I'm getting this error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-ca64c0de9770> in <module>
----> 1 data = get_data_from_musicxml('data/haendel_systematic.musicxml', 149, convert_stress=True)
      2 panning = {'Soprano': [-60, -30], 'Alto': [-40, -10], 'Tenor': [30, 60], 'Bass': [10, 40]}

C:\mellotron\mellotron_utils.py in get_data_from_musicxml(filepath, bpm, phoneme_durations, convert_stress)
    475         f0s = event2f0(events_arpabet)
    476         alignment, f0s = remove_excess_frames(alignment, f0s)
--> 477         text_encoded, text_clean = event2text(events_arpabet, convert_stress)
    478 
    479         # convert data to torch

C:\mellotron\mellotron_utils.py in event2text(events, convert_stress, cmudict)
    438         text_clean = re.sub('[0-9]', '1', text_clean)
    439 
--> 440     text_encoded = text_to_sequence(text_clean, [], cmudict)
    441     return text_encoded, text_clean
    442 

C:\mellotron\text\__init__.py in text_to_sequence(text, cleaner_names, dictionary)
     44       clean_text = _clean_text(text, cleaner_names)
     45       if cmudict is not None:
---> 46         clean_text = [get_arpabet(w, dictionary) for w in clean_text.split(" ")]
     47         for i in range(len(clean_text)):
     48             t = clean_text[i]

C:\mellotron\text\__init__.py in <listcomp>(.0)
     44       clean_text = _clean_text(text, cleaner_names)
     45       if cmudict is not None:
---> 46         clean_text = [get_arpabet(w, dictionary) for w in clean_text.split(" ")]
     47         for i in range(len(clean_text)):
     48             t = clean_text[i]

C:\mellotron\text\__init__.py in get_arpabet(word, dictionary)
     14 
     15 def get_arpabet(word, dictionary):
---> 16   word_arpabet = dictionary.lookup(word)
     17   if word_arpabet is not None:
     18     return "{" + word_arpabet[0] + "}"

AttributeError: 'NoneType' object has no attribute 'lookup'

Then I redid the same procedure but changing the word "sys·​tem·​at·​ic" with "hal·​le·​lu·​jah" and it works again. I'm really confused now. For reference, here's the files I used: haendel_hallelujah.musicxml haendel_systematic.musicxml I compared them side by side and don't see anything out of place.

rafaelvalle commented 4 years ago

Pull from master and try again. The musicxml converter is a simple prototype to get people started and we our community will improve it.

For your AttributeError while training, it is likely that you have a line without a speaker id.

AndroYD84 commented 4 years ago

I pulled from master and the same musicxml that wouldn't want to play before (I switched only the word "hallelujah" with "systematic") is working now! About the training, you're absolutely correct, there was a problem in my filelist, turns out that when I generated the text from audio using STT, I didn't consider the possibility of parts with silence only for more than 4 seconds, so there were two lines that had missing information, now I could start the training without problems (so far). I managed to generate audio from a custom musicxml too, but it required plenty of trial and error, as it wouldn't want to work despite it was apparently flawless, turns out there was a single note between the bunch without lyrics applied to it, after I removed that single note it finally worked, the converter doesn't explicitly point out which part of the data or which word is throwing an error, making it difficult to narrow down all the possibile causes (could be really anything, a typo, a note, a special character, a single lyricless note in a sea of notes, something that shouldn't be there, etc.), it happens that some musicxmls will throw errors and I can't figure out at all what is possibly wrong with them even after checking on them carefully, some problems can be elusive at best.

rafaelvalle commented 4 years ago

Great that you were able to get it working from a custom musicxml too. Please add a pull request if you make improvements to the musicxml parser.

camjac251 commented 4 years ago

Are you still able to run this on Windows with Anaconda? I've been trying to get it to work with Windows 10 and have been facing many issues. I finally got it to work but with iterations happening every 8 seconds. Have you tried to use tensorflow-gpu instead of the cpu tensorflow? Does that help speed?