Open codename0og opened 1 month ago
@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing? Was there anything applied?
I just checked some samples in the vctk dataset and it's really bad.
-tons of mouth clicks -loud mic noise -low frequency rumbling noise (could be DC offset issue) -lacks breathes sounds -lacks pitch variations (speaker's pitches just sits about 110hz to 200hz) -lacks higher harmonic details (causes it to have flipping harmonic and static harmonic artifacting)
I don't think they even apply processing to the audios, the dataset is also bad in the first place.
@RVC-Boss We already know that for the dataset, the vctk corpus ( 108 speaker version ) was used but how about the processing? Was there anything applied?
I am asking because, as much as I've done tons of models I still can't quite find anything useful in that regard based on my trainings;
A) Is it better to limit the dynamic range of the dataset to the possible maximum ( without distortion introduction ofc ) B) Maintaining it somewhat natural ( slight peaks taming + slight compression to even stuff out and then -2 or -3 db general norm ) B) Taking care of the harsher peaks / peaks in general but leaving the dynamic range alone
What kind of approach you think would be suitable for your pretrains? I'd really benefit from such information, and I am pretty sure some other more advanced users too. Thank you in advance!