IAHispano / Applio

A simple, high-quality voice conversion tool focused on ease of use and performance
https://applio.org
MIT License
1.81k stars 292 forks source link

[Feature]: Add audio out of dataset to audio section in TensorBoard #878

Open BornSaint opened 2 days ago

BornSaint commented 2 days ago

Description

When training, the script choose one audio from dataset to be on tensorboard each epoch, but using an audio with same features than the model trained make it hard to see if the training is well enough. I still can see by loss graphic if it's starting to overfit, but hearing the audio could help when can't train for many time and the quality is already acceptable and stop training.

Problem

already in description

Proposed Solution

add an option for cli script to pick an audio, something like, --tensorboard-audio "/path/to/audio/file" and for GUI could just add a gradio element to pick audio.

Alternatives Considered

not exactly an alternative, but would be awesome an auto-stop training when values don't change in a range, like, --auto-stop 10 would stop if model don't get better when finish next 10 epochs, or if get better, reset the count.

BornSaint commented 2 days ago

***my alternative is actually already implemented

BornSaint commented 2 days ago

i guess this commit changes random tensorboard audio to first audio from dataset for evaluation, but it still compromise the reference, like i said my comment in this commit page

the first sample is not used on training? same audio on training and eval could compromise the reference for people training the model, e.g. me. Wouldn't be better if add an option to select external audio for tensorboard instead picking from dataset?

Better alternative is to exclude first sample of training loader and set it exclusively for evaluation**

BornSaint commented 2 days ago

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled

i would have any issue enabling it in Applio 3.2.7?

AznamirWoW commented 2 days ago

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled i would have any issue enabling it in Applio 3.2.7?

How to create your own reference:

1) prepare a .wav file, no longer than 5 seconds 2) use training tab to create a new model at desired sampling rate, lets say 32000

4) remove True == False and from the train.py code

BornSaint commented 2 days ago

Many thanks, love it! You can close it if you wish.

AirJCovers34 commented 1 day ago

find out these comments in rvc/train/train.py

441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled i would have any issue enabling it in Applio 3.2.7?

How to create your own reference:

  1. prepare a .wav file, no longer than 5 seconds
  2. use training tab to create a new model at desired sampling rate, lets say 32000
  • in preprocess uncheck audio cutting and process audio
  • run preprocess, run feature extraction
  1. move the files to reference folder, rename as listed
  • .wav file from sliced audios, rename to ref32000.wav
  • .wav.npy file from f0 folder, rename to ref32000_f0c.wav
  • .wav.npy file from f0_voiced folder, rename to ref32000_f0f.npy
  • .npy file from v2_extracted folder, rename to ref32000_feats.npy these file should replace what was provided in /logs/reference with 3.2.7 release
  1. remove True == False and from the train.py code

That's exactly what I was trying to do. But when starting the training, I get this error:

Running on local URL:  http://127.0.0.1:6927

To create a public link, set `share=True` in `launch()`.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00,  5.60s/it]
Preprocess completed in 5.61 seconds on 00:00:04 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00,  1.38s/it]
Pitch extraction completed in 7.17 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.81it/s]
Embedding extraction completed in 6.87 seconds.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.56s/it]
Preprocess completed in 34.56 seconds on 00:34:48 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
  0%|                                                                                            | 0/1 [00:00<?, ?it/s]An error occurred extracting file C:\ApplioV327\logs\Test_BensonBoone\sliced_audios_16k\0_0_0.wav on cuda:0: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.20s/it]
Pitch extraction completed in 21.78 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:35<00:00, 35.82s/it]
Embedding extraction completed in 41.39 seconds.
Starting training...
Loaded pretrained (G) 'rvc\models\pretraineds\pretraineds_custom\G-f048k-TITAN-Medium.pth'
Loaded pretrained (D) 'rvc\models\pretraineds\pretraineds_custom\D-f048k-TITAN-Medium.pth'
Process Process-1:
Traceback (most recent call last):
  File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ApplioV327\rvc\train\train.py", line 482, in run
    train_and_evaluate(
  File "C:\ApplioV327\rvc\train\train.py", line 680, in train_and_evaluate
    if loss_mel > 75:
UnboundLocalError: local variable 'loss_mel' referenced before assignment
Saved index file 'C:\ApplioV327\logs\Test_BensonBoone\added_Test_BensonBoone_v2.index'

Any idea what I might be doing wrong? 🤔

AznamirWoW commented 1 day ago

Any idea what I might be doing wrong? 🤔

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

AirJCovers34 commented 23 hours ago

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

Could you elaborate, please?

AznamirWoW commented 22 hours ago

Dont train on those small references. Use wav, two f0 files and feature file as references instead.

Could you elaborate, please?

to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder

AirJCovers34 commented 19 hours ago

to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder

That's exactly what I did. But it seems the error lies now at another level... 😥

image

AznamirWoW commented 19 hours ago

Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes

AirJCovers34 commented 19 hours ago

Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes

Is it possible to fix this issue? Or should I accept that training won't be possible with version 3.2.7?

AznamirWoW commented 19 hours ago

You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.

What I had included with 3.2.7 was this

G:\ApplioV3.2.7\logs\reference>python Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import soundfile as sf import librosa import numpy as np audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000) print(audio.shape) (147122,) f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy") f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy") feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy") print(f0c.shape) (307,) print(f0f.shape) (307,) print(feats.shape) (153, 768)

feature gets expanded 2x (153 -> 306) pitch gets the last dimentsion trimmed (307->306)

so they match each other in size.

AirJCovers34 commented 18 hours ago

You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.

What I had included with 3.2.7 was this

G:\ApplioV3.2.7\logs\reference>python Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.

import soundfile as sf import librosa import numpy as np audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000) print(audio.shape) (147122,) f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy") f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy") feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy") print(f0c.shape) (307,) print(f0f.shape) (307,) print(feats.shape) (153, 768)

feature gets expanded 2x (153 -> 306) pitch gets the last dimentsion trimmed (307->306)

so they match each other in size.

On my side, I get this:

C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(100258259,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)
AznamirWoW commented 17 hours ago

Why your reference wav is so big? (100258259,) - that's 30 minutes+

I said use a 5-10 sec sample at most.

AirJCovers34 commented 17 hours ago

Why your reference wav is so big? (100258259,) - that's 30 minutes+

I said use a 5-10 sec sample at most.

File error when replacing.. 😉😂 It's better now.

C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(192001,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)
(153, 768)
>>>