Open BornSaint opened 2 days ago
***my alternative is actually already implemented
i guess this commit changes random tensorboard audio to first audio from dataset for evaluation, but it still compromise the reference, like i said my comment in this commit page
the first sample is not used on training? same audio on training and eval could compromise the reference for people training the model, e.g. me. Wouldn't be better if add an option to select external audio for tensorboard instead picking from dataset?
find out these comments in rvc/train/train.py
441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled
i would have any issue enabling it in Applio 3.2.7?
find out these comments in rvc/train/train.py
441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled i would have any issue enabling it in Applio 3.2.7?
How to create your own reference:
1) prepare a .wav file, no longer than 5 seconds 2) use training tab to create a new model at desired sampling rate, lets say 32000
4) remove True == False and
from the train.py code
Many thanks, love it! You can close it if you wish.
find out these comments in rvc/train/train.py
441 # get the first sample as reference for tensorboard evaluation 442 # custom reference temporarily disabled i would have any issue enabling it in Applio 3.2.7?
How to create your own reference:
- prepare a .wav file, no longer than 5 seconds
- use training tab to create a new model at desired sampling rate, lets say 32000
- in preprocess uncheck audio cutting and process audio
- run preprocess, run feature extraction
- move the files to reference folder, rename as listed
- .wav file from sliced audios, rename to ref32000.wav
- .wav.npy file from f0 folder, rename to ref32000_f0c.wav
- .wav.npy file from f0_voiced folder, rename to ref32000_f0f.npy
- .npy file from v2_extracted folder, rename to ref32000_feats.npy these file should replace what was provided in /logs/reference with 3.2.7 release
- remove
True == False and
from the train.py code
That's exactly what I was trying to do. But when starting the training, I get this error:
Running on local URL: http://127.0.0.1:6927
To create a public link, set `share=True` in `launch()`.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.60s/it]
Preprocess completed in 5.61 seconds on 00:00:04 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:01<00:00, 1.38s/it]
Pitch extraction completed in 7.17 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.81it/s]
Embedding extraction completed in 6.87 seconds.
Starting preprocess with 8 processes...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:34<00:00, 34.56s/it]
Preprocess completed in 34.56 seconds on 00:34:48 seconds of audio.
Starting pitch extraction with 8 cores on cuda:0 using rmvpe...
0%| | 0/1 [00:00<?, ?it/s]An error occurred extracting file C:\ApplioV327\logs\Test_BensonBoone\sliced_audios_16k\0_0_0.wav on cuda:0: cuDNN error: CUDNN_STATUS_NOT_SUPPORTED. This error may appear if you passed in a non-contiguous input.
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:16<00:00, 16.20s/it]
Pitch extraction completed in 21.78 seconds.
Starting embedding extraction with 8 cores on cuda:0...
100%|████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:35<00:00, 35.82s/it]
Embedding extraction completed in 41.39 seconds.
Starting training...
Loaded pretrained (G) 'rvc\models\pretraineds\pretraineds_custom\G-f048k-TITAN-Medium.pth'
Loaded pretrained (D) 'rvc\models\pretraineds\pretraineds_custom\D-f048k-TITAN-Medium.pth'
Process Process-1:
Traceback (most recent call last):
File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "C:\ApplioV327\env\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "C:\ApplioV327\rvc\train\train.py", line 482, in run
train_and_evaluate(
File "C:\ApplioV327\rvc\train\train.py", line 680, in train_and_evaluate
if loss_mel > 75:
UnboundLocalError: local variable 'loss_mel' referenced before assignment
Saved index file 'C:\ApplioV327\logs\Test_BensonBoone\added_Test_BensonBoone_v2.index'
Any idea what I might be doing wrong? 🤔
Any idea what I might be doing wrong? 🤔
Dont train on those small references. Use wav, two f0 files and feature file as references instead.
Dont train on those small references. Use wav, two f0 files and feature file as references instead.
Could you elaborate, please?
Dont train on those small references. Use wav, two f0 files and feature file as references instead.
Could you elaborate, please?
to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder
to make reference files you just need to do preprocess and extract features and use the files generated from those to replace references in logs/reference folder
That's exactly what I did. But it seems the error lies now at another level... 😥
Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes
Hmm... okay, I kinda expected that. There's some alignment between pitch and phoneme tensors that needs to be made and it is quite annoying for random sample sizes
Is it possible to fix this issue? Or should I accept that training won't be possible with version 3.2.7?
You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.
What I had included with 3.2.7 was this
G:\ApplioV3.2.7\logs\reference>python Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
import soundfile as sf import librosa import numpy as np audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000) print(audio.shape) (147122,) f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy") f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy") feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy") print(f0c.shape) (307,) print(f0f.shape) (307,) print(feats.shape) (153, 768)
feature gets expanded 2x (153 -> 306) pitch gets the last dimentsion trimmed (307->306)
so they match each other in size.
You can disable the custom reference and fall back to the original 3.2.6 method of picking a random sample from the training set. Or you can try making a different size of reference audio.
What I had included with 3.2.7 was this
G:\ApplioV3.2.7\logs\reference>python Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information.
import soundfile as sf import librosa import numpy as np audio, sr = librosa.load(r"G:\ApplioV3.2.7\logs\reference\ref48000.wav", sr=48000) print(audio.shape) (147122,) f0c = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0c.npy") f0f = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_f0f.npy") feats = np.load(r"G:\ApplioV3.2.7\logs\reference\ref48000_feats.npy") print(f0c.shape) (307,) print(f0f.shape) (307,) print(feats.shape) (153, 768)
feature gets expanded 2x (153 -> 306) pitch gets the last dimentsion trimmed (307->306)
so they match each other in size.
On my side, I get this:
C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(100258259,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)
Why your reference wav is so big? (100258259,) - that's 30 minutes+
I said use a 5-10 sec sample at most.
Why your reference wav is so big? (100258259,) - that's 30 minutes+
I said use a 5-10 sec sample at most.
File error when replacing.. 😉😂 It's better now.
C:\ApplioV327\logs\reference>python
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr 5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import soundfile as sf
>>> import librosa
>>> import numpy as np
>>> audio, sr = librosa.load(r"C:\ApplioV327\logs\reference\ref48000.wav", sr=48000)
>>> print(audio.shape)
(192001,)
>>> (147122,)
(147122,)
>>> f0c = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0c.npy")
>>> f0f = np.load(r"C:\ApplioV327\logs\reference\ref48000_f0f.npy")
>>> feats = np.load(r"C:\ApplioV327\logs\reference\ref48000_feats.npy")
>>> print(f0c.shape)
(401,)
>>> (307,)
(307,)
>>> print(f0f.shape)
(401,)
>>> (307,)
(307,)
>>> print(feats.shape)
(199, 768)
>>> (153, 768)
(153, 768)
>>>
Description
When training, the script choose one audio from dataset to be on tensorboard each epoch, but using an audio with same features than the model trained make it hard to see if the training is well enough. I still can see by loss graphic if it's starting to overfit, but hearing the audio could help when can't train for many time and the quality is already acceptable and stop training.
Problem
already in description
Proposed Solution
add an option for cli script to pick an audio, something like, --tensorboard-audio "/path/to/audio/file" and for GUI could just add a gradio element to pick audio.
Alternatives Considered
not exactly an alternative, but would be awesome an auto-stop training when values don't change in a range, like, --auto-stop 10 would stop if model don't get better when finish next 10 epochs, or if get better, reset the count.