Open kyakuno opened 3 years ago
1st stage https://github.com/Hiroshiba/yukarin
2nd stage https://github.com/Hiroshiba/become-yukarin
3rd stage https://github.com/Hiroshiba/realtime-yukarin
ããŖãŧããŠãŧããŗã°ãŽåã§įĩæããããŽåŖ°ãĢãĒãŖãĻãŋãīŧ2018-02-13īŧ https://blog.hiroshiba.jp/became-yuduki-yukari-with-deep-learning-power/
DeepLearningã§ãåŖ°čŗĒå¤æãããīŧīŧ2017-12-10īŧ https://blog.hiroshiba.jp/voice-conversion-deep-leanring-and-other-delusions/
CycleGANããŗããŠãŦãĢįĩæãããåŖ°čŗĒå¤æããŖãĻãŋãīŧ2018-04-22īŧ https://blog.hiroshiba.jp/became-yuduki-yukari-with-cycle-gan-power/
åĻįŋãĢåŋ čĻãĒãããŠãŦãĢããŧãŋãĢã¤ããĻ https://medium.com/@crosssceneofwindff/%E7%BE%8E%E5%B0%91%E5%A5%B3%E5%A3%B0%E3%81%B8%E3%81%AE%E5%A4%89%E6%8F%9B%E3%81%A8%E5%90%88%E6%88%90-fe251a8e6933 https://aidiary.hatenablog.com/entry/20150310/1425983455 https://www.jstage.jst.go.jp/article/jasj/72/6/72_324/_pdf
芹ããĻããå 厚ãåä¸ã§ããããŗããŠãŦãĢããŧãŋãŽãæéčģ¸ãæããĻããŠãŦãĢããŧãŋåããåĻį īŧįč ã¯ä¸æãã§ããĻãĒãã¨äģ°ãŖãĻããããåãä¸æãã§ããĻãããããĢæããžãīŧ https://blog.hiroshiba.jp/sandbox-alignment-voice-actress-data/
åŽĸéãããį´šäģããĻä¸ããŖããåį¨åŠį¨å¯ã§ãããããšãčĒãŋä¸ããŊãããĻã§ãĸ īŧãįĄæã§äŊŋããä¸åčŗĒãĒãããšãčĒãŋä¸ããŊãããĻã§ãĸãã¨čŦŗãããĻããžãããåäēēįãĢã¯éĢåčŗĒã ã¨æããžãīŧ īŧãããšããčĒãŋä¸ããĻããããããŽéŗåŖ°ããŋãŧã˛ãããĢããã¨č¯ãããŽã¨æãããžãīŧ īŧhiroshibaããäŊīŧ https://voicevox.hiroshiba.jp/ âŧ ããŽGithubãĒãã¸ããĒ īŧjsonãå ĨåãĢããããšãâéŗåŖ°ããŗããŗããŠã¤ãŗã§įæã§ããã¨æãããžãīŧ https://github.com/Hiroshiba/voicevox
Qiitač¨äēãįĄæã§éŗåŖ°čĒčãĢäŊŋããããŧãŋãģãã5ã¤ã https://qiita.com/yarimoto/items/98711f23f90ea068730b âŧ MozillaãįēčĄããĻãããĻããéŗåŖ°ããŧãŋãģãã īŧæĨæŦčĒãŽãããšãã¨ãããŽčĒãŋä¸ãéŗåŖ°ãåĢãžããįēãīŧ https://commonvoice.mozilla.org/ja/datasets
MacãĢnode.jsãã¤ãŗãšããŧãĢīŧvoicevoxæåŽãŽverã¯14.17.4īŧ https://qiita.com/kyosuke5_20/items/c5f68fc9d89b84c0df09 âŧ ãŠããä¸æãåããâĻããã¤ãGPUãįĄãã¨æ¨čĢãé ãããã§ããįēãMACã§ãŽåäŊã¯æåŋĩããžããã
voice_datasetsããžã¨ã https://github.com/jim-schwoebel/voice_datasets
éĄäŧŧæčĄīŧīŧīŧīŧAUTOVC: Zero-Shot Voice Style Transfer with Only Autoencoder Loss https://github.com/auspicious3000/autovc https://auspicious3000.github.io/autovc-demo/
éĄäŧŧæčĄīŧīŧīŧīŧAssem-VC â Official PyTorch Implementation https://github.com/mindslab-ai/assem-vc https://mindslab-ai.github.io/assem-vc/
éĄäŧŧæčĄīŧīŧīŧīŧVQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion (Interspeech 2021) https://github.com/Wendison/VQMIVC https://wendison.github.io/VQMIVC-demo/
éĄäŧŧæčĄīŧīŧīŧīŧMediumVC https://github.com/BrightGu/MediumVC https://brightgu.github.io/MediumVC/ https://arxiv.org/pdf/2110.02500.pdf
éĄäŧŧæčĄīŧīŧīŧīŧSingleVC https://github.com/BrightGu/SingleVC https://brightgu.github.io/SingleVC/
éĄäŧŧæčĄīŧīŧīŧīŧStarGANv2-VC https://github.com/yl4579/StarGANv2-VC https://starganv2-vc.github.io/
åŖ°čŗĒå¤æéĸéŖãŽį¨čĒįããčŠŗããč§ŖčĒŦããĻä¸ããŖãĻããč¨äē https://blog.nefrock.com/entry/2020/03/17/171730
yukarinã§į˛žåēĻãéĢããããããŽããĻããĻãč¨äēãĢããĻä¸ããŖãĻããã㎠https://qiita.com/atticatticattic/items/848869a32413a378ee6d
yukarinãĢãĻäŊŋį¨ãããĻããpyworldãĢã¤ããĻãŽč§ŖčĒŦč¨äē https://qiita.com/ohtaman/items/84426cee09c2ba4abc22
čĢæãĩãŧãã¤īŧSinging Voice ConversionīŧæåŖ°å¤æīŧ https://aria3366.hatenablog.com/
yukarinã§åĻįŋãããĢåŊããŖãĻåŽčˇĩããå 厚ãĢã¤ããžããĻãč¨é˛ããããĻé ããžãã
YukarinãĒãã¸ããĒãŽREADMEīŧæĨæŦčĒįīŧãåčãĢé˛ããĻãããžãã¨ããžãæåãĢãåŋ
čĻãĒãŠã¤ããŠãĒãŽã¤ãŗãšããŧãĢããŽæį¤ēããããžãã
pip install -r requirements.txt
ãĢããããŽã§ãããããŽå
厚ãäģĨä¸ã¨ãĒããžãã
https://github.com/Hiroshiba/yukarin/blob/master/requirements.txt
numpy
cupy<6.0.0
chainer<6.0.0
librosa<0.7.0
pysptk
pyworld
matplotlib
tensorflow
tqdm
git+https://github.com/neka-nat/tensorboard-chainer
git+https://github.com/Hiroshiba/become-yukarin
ããã§æ°ãĢãĒããŽã¯ãcupy
chainer
librosa
ãŽããŧã¸ã§ãŗãäŊããã¨ã§ãã
PyPIã§ãRelease historyãįĸēčĒããĒãããå
ˇäŊįãĢäģĨä¸ãĢãĻã¤ãŗãšããŧãĢãåŽæŊããĻãŋãžããã
å°ãį§ãåŽæŊããį°åĸã¯ãUbuntu 18.04.6 LTS
ãĢãĻãGPUã¯RTX3090
ãäŊŋį¨ããĻããžãã
RTX3090
ãį¨ããĻãããã¨ãããCUDAã¯ããŧã¸ã§ãŗã11
äģĨä¸ã§ããåŋ
čĻãããããããĢäŧ´ãŖãĻCUDNNãããŧã¸ã§ãŗã8
äģĨä¸ã¨ãĒããžãã
!pip install cupy==5.4.0 # <6.0.0
!pip install chainer==5.4.0 # <6.0.0
!pip install librosa==0.6.3 # <0.7.0
įĩæãchainer
ã¨liblosa
ã¯ã芲åŊããŧã¸ã§ãŗãĢãĻã¤ãŗãšããŧãĢãã§ããžããããcupy
ã¯ã¤ãŗãšããŧãĢãã§ããžããã§ããã
įįąã¯æããäģĨä¸ã§ãã
**************************************************
*** WARNING: Unsupported cuDNN version: 8005
*** WARNING: cuDNN v5000= and <=v7999 is required
**************************************************
ãã ãcupy
ãĢéĸããĻã¯ããĒãã¸ããĒå
ã§äŊŋį¨įŽæãæ¤į´ĸããĻãŋãã¨ãããéčĻãĒåĻįé¨åã§ã¯ãĒãããã§ããįēãä¸æĻãããã§č¯ãã¨ãããã¨æããžãã
yukarinãĒãã¸ããĒã¯ãããŠãŦãĢããŧãŋãį¨ããåŖ°čŗĒå¤æãčĄãããŽã§ããįēããããį¨æããåŋ
čĻããããžãã
å°ãå¤æå¯žčąĄã¨ããĻã¯ãããįšåŽãŽæšãŽåŖ°ããææãŽæšã¸ãŽåŖ°ã¸ã¨å¤æããã¨ããããŽã§ããŽã§ãone-to-one
ãŽå¯žčąĄã¨ãĒããžãã
åčīŧhttps://blog.nefrock.com/entry/2020/03/17/171730
å ããå¤æå ãŽããŧãŋã§ãããåŽĸéãããåąéããĻä¸ããŖãĻããVoiceVoxãį¨ãããã¨æããžãã å°ãVoiceVoxãŽéŗåŖ°ã¯ã2021/11/28æįšãĢãĻãåēå ¸ãŽč¨čŧãĢããåį¨īŧéåį¨å ąãĢ訹å¯ãããĻé ããĻããæ¨Ąæ§ã§ãã
ããŖãŠã¯ãŋãŧãæĨæĨé¨ã¤ããããŽåŠį¨čĻį´ https://tsukushinyoki10.wixsite.com/ktsumugiofficial/%E5%88%A9%E7%94%A8%E8%A6%8F%E7%B4%84
VoiceVoxãŽäŊŋį¨ãĢéĸããĻã¯ãvoicevox_enginãĒãã¸ããĒãåčãĢé˛ããĻããã¨ãåšįããéŗåŖ°ããŧãŋãååžãããã¨ãã§ããžãã
äžãã°ãæčĒæé10į§ãããå
厚ã§ãã1į§ãã¯ãããĢä¸åãæéã§éŗåŖ°ãįæãããžãã
īŧâģããã¯å¤æãĢGPURTX3090
ãį¨ããéãŽåŽčˇĩįĩæãĢããééã¨ãĒããžããīŧ
ä¸č¨ãĢéĸããĻãį§ãåŽčˇĩããæšæŗã¨ããĻã¯ãVOICEVOXãŽã¤ãŗãšããŧãĢããããŧã¸č¨čŧãŽæé ãĢãĻãUbuntuãĢ寞ããĻčĄããžããã ãããĻãVOICEVOXãčĩˇåããĻããįļæ ãĢãĻãvoicevox_engineãĒãã¸ããĒãŽAPIãããĨãĄãŗããĢč¨čŧãããĻããæšæŗãåčãĢãshellãįĩãŋãåŽčˇĩããåŊĸã¨ãĒããžãã å°ãcurlãĢã¤ããĻã¯ããĸã¯ãģãšãããĸããŦãšããããĢã¯ãĒãŧããŧãˇã§ãŗã§æŦãŖãæšããåäŊãåŽåŽããã¨ãŽãã¨ã§ãäģĨä¸ãŽãããĢåŽčˇĩãããã¨ããĒãšãšãĄãããĻé ãæŦĄįŦŦã§ãã īŧâģåčīŧzsh: no matches found:ã¨ãĒãŖãæãŽå¯žåŋæšæŗīŧ
echo -n "ãããĢãĄã¯ãéŗåŖ°åæãŽä¸įã¸ãããã" >text.txt
curl -s \
-X POST \
"localhost:50021/audio_query?speaker=1" \
--get --data-urlencode text@text.txt \
> query.json
curl -s \
-H "Content-Type: application/json" \
-X POST \
-d @query.json \
"localhost:50021/synthesis?speaker=1" \
> audio.wav
ããŠãŦãĢããŧãŋãĢããããŋãŧã˛ãããŽéŗåŖ°ã¯ãä¸č¨ã§į¨æã§ããčĻéããããŖãįēãæŦĄãĢã¯ãVoiceVoxãĢæčĒããĻããããããšãã¨ããããåĨãŽæšãæčĒããĻããéŗåŖ°ããŧãŋãåŋ
čĻã¨ãĒããžãã
å°ãyukarinãĒãã¸ããĒčĒäŊã¯ãone-to-one
ãŽåŖ°čŗĒå¤æãčĄãããŽã¨ãĒãŖãĻããžãããä¸įšåŽå¤æ°ãŽæšãäŊŋį¨ãããailia-modelsãĢæčŧããæŠčŊã¨ããĻã¯
ãmany-to-one
ãĢããåŋ
čĻãããããã¨æããžãã
ãã ããäģĨä¸ãŽyukarinãŽissuesãčĻãã¨ãmany-to-one
ãŽåŽæŊįĸēčĒã¯ãããĻãããããĒå°čąĄãåããžãã
https://github.com/Hiroshiba/yukarin/issues/49
ä¸č¨ããŧãēãæēããããŧãŋãģããã¨ããĻãMozillaãįēčĄããĻãããĻããéŗåŖ°ããŧãŋãģããã§ãcommonvoiceã¨ããããŽãåå¨ããžããã ããã¯ãæĨæŦčĒãŽãããšãã¨ãéŗåŖ°ããĄã¤ãĢīŧ.mp3īŧããģããã§åå¨ããããŽã§ãã éŗåŖ°ããĄã¤ãĢã¯įžįļã§24,000åŧąį¨åå¨ã芹č æ°ãå¤ãå°čąĄīŧ1äēēãŽčŠąč ããã10ã20åãŽéŗåŖ°ããĄã¤ãĢãįæãããĻããå°čąĄīŧã§ãã
ããŽåĻįãčĄãshellããĄã¤ãĢå
厚ããäģĨä¸ã¨ãĒããžãã
å°ã./text/*
é
ä¸ãĢã¯ãéŗåŖ°ããĄã¤ãĢæ¯ãĢ寞åŋãããæčĒå
厚ãŽãããšãããĄã¤ãĢããããŽåæ°ã1寞1ã¨ãĒããããĒåŊĸã§ã24,000åŧąåæ ŧį´ãããĻããåŊĸã§ãã
#!/bin/zsh
for input in ./text/*
do
echo "input = $input"
curl -s \
-X POST \
"localhost:50021/audio_query?speaker=8"\
--get --data-urlencode text@$input \
> query_.json
output=`echo ${input/text/audio}`
output=`echo ${output/.txt/.wav}`
echo "output = $output"
curl -s \
-H "Content-Type: application/json" \
-X POST \
-d @query_.json \
"localhost:50021/synthesis?speaker=8" \
> $output
done
äģĨä¸ã§ãæããã¯ãyukarinãĒãã¸ããĒã§ãŽåĻįŋããŧãŋãįæãããããŽã¨æãããžãã
æŦĄãŽæé ãĢåãããĻãäģĨä¸ãŽãããĒããŠãĢãæ§æã¨ããžãã
$ tree *_wav/
input_wav/
âââ common_voice_ja_19482480.mp3
âââ common_voice_ja_19482491.mp3
âââ common_voice_ja_19482498.mp3
âââ âĻ
âââ common_voice_ja_27446518.mp3
âââ common_voice_ja_27446519.mp3
âââ common_voice_ja_27446520.mp3
target_wav/
âââ common_voice_ja_19482480.wav
âââ common_voice_ja_19482491.wav
âââ common_voice_ja_19482498.wav
âââ âĻ
âââ common_voice_ja_27446518.wav
âââ common_voice_ja_27446519.wav
âââ common_voice_ja_27446520.wav
0 directories, 46796 files
å°ãåãããĄã¤ãĢåį§°īŧæĄåŧĩåãé¤ãīŧãåãããĄã¤ãĢã¯ãį°ãĒã芹č
ãåãå
厚ã芹ããĻããåŊĸã¨ãĒãŖãĻãããtarget_wav
ããŠãĢãå
ãŽéŗåŖ°ããĄã¤ãĢã¯å
¨ãĻãVOICEVOXãŽãæĨæĨé¨ã¤ãããã¨ããããŖãŠã¯ãŋãŧãŽéŗåŖ°ã¨ãĒãŖãĻããžãã
yukarinãŽREADMEãĢããã°ãæŦĄãŽãšãããã¨ããĻãããŧãŋããéŗéŋįšåž´éãåãåēãã¨ãŽčĒŦæããããžãã ãããåŽčˇĩãããŗããŗãã¯ãäģĨä¸ã¨ãŽãã¨ã§ãã
python scripts/extract_acoustic_feature.py \
-i './input_wav/*' \
-o './input_feature/'
python scripts/extract_acoustic_feature.py \
-i './target_wav/*' \
-o './target_feature/'
ããã§ãscripts/extract_acoustic_feature.pyãįĸēčĒããĻãŋãžãã¨ãäģĨä¸ãŽč¨åŽãŗãŧããåēãĢãéŗéŋįšåž´éãåãåēããĻããããã§ããã
class AcousticParam(object):
def __init__(
self,
sampling_rate: int = 24000,
pad_second: float = 0,
threshold_db: float = None,
frame_period: int = 5,
order: int = 8,
alpha: float = 0.466,
f0_floor: float = 71,
f0_ceil: float = 800,
fft_length: int = 1024,
dtype: str = 'float32',
) -> None:
self.sampling_rate = sampling_rate
self.pad_second = pad_second
self.threshold_db = threshold_db
self.frame_period = frame_period
self.order = order
self.alpha = alpha
self.f0_floor = f0_floor
self.f0_ceil = f0_ceil
self.fft_length = fft_length
self.dtype = dtype
def _asdict(self):
return self.__dict__
æŗ¨įŽããšããã¤ãŗãã¨ããĻã¯ãsampling_rate
ã§ãã
24,000Hzã¨ããč¨åŽãĢãĒãŖãĻããžãã
VOICEVOXãĢãĻįæããããããŖãŠã¯ãŋãŧéŗåŖ°ãŽãããŠãĢããŽãĩãŗããĢãŦãŧããĢã¤ããĻãã24,000Hzã¨ãĒãŖãĻããžããã
yukarinãŽissuesãĢãããĻããįĩæãããåŖ°čŗĒå¤æãŽéŗåŖ°ã¯24000HzãæŗåŽããĻããã
ã¨ããč¨čŧããããžãã
ãžããããŽč¨åŽå¤ãį¨ããĻãäģĨä¸ãŗãŧããĢãĻãįšåž´æŊåēãčĄãŖãĻããžãã
f0, t = cls.extract_f0(x=x, fs=fs, frame_period=frame_period, f0_floor=f0_floor, f0_ceil=f0_ceil)
sp = pyworld.cheaptrick(x, f0, t, fs, fft_size=fft_length)
ap = pyworld.d4c(x, f0, t, fs, fft_size=fft_length)
ããŽãŗãŧãã¯ãäģĨä¸č¨äēãĢæ¸ãããĻããįšåž´æŊåēãŽæšæŗã¨åæ§ãĢãĒãŖãĻããžãã WORLDã¨ãããŠã¤ããŠãĒãåãŗãWORLDãŽpython wrapperãį¨ããĻããžãã https://r9y9.github.io/nnmnkwii/v0.0.1/nnmnkwii_gallery/notebooks/00-Quick%20start%20guide.html#Acoustic-features
å°ãæŊåēããįšåž´ãĢéĸããĻã¯ãäģĨä¸pdfãč§ŖčĒŦãããĻä¸ããŖãĻããžããã http://www.isc.meiji.ac.jp/~mmorise/lab/publication/paper/SP2017-128.pdf
WORLD ãĢããéŗåŖ°åæåæãŽæĻčĻãåŗ 1 ãĢį¤ēãīŧ
WORLDã§ã¯īŧéŗåŖ°ãããŦãŧã ãˇããåš
æ¯ãŽæéã§åæãīŧããŦãŧã æ¯ãĢ 3 ã¤ãŽããŠãĄãŧãŋãååžããīŧ
ããŠãĄãŧãŋã¯īŧåēæŦå¨æŗĸæ° (Fundamental frequency: F0)īŧãšãã¯ããĢå
įĩĄ (Spectral envelope: SP)īŧéå¨ææ§ææ¨ (Aperiodicity: AP) ㎠3 į¨ŽéĄã§ããīŧ
ããããŽããŠãĄãŧãŋã¯īŧããããéŗåŖ°ãŽéĢãīŧéŗåŖ°ãŽéŗč˛īŧéŗåŖ°ãŽããããŽį¨åēĻãĢ寞åŋããĻããīŧ
äģĨä¸ãŽQiitač¨äēãį´šäģãããĻä¸ããŖãĻããžãã https://qiita.com/ohtaman/items/84426cee09c2ba4abc22
1. åēæŦå¨æŗĸæ°īŧåŖ°ãŽããŧãšã¨ãĒãéĢãããããããžã
2. ãšãã¯ããĢå
įĩĄīŧãããããšãã¯ããĢãæģãããĢããããŽã§ãéŗč˛ããããããžã
3. éå¨ææ§ææ¨īŧåŖ°å¸¯æ¯åãŽããããééŗæˇˇå
ĨãĢããåŊąéŋããããããžã
ãžããMozillaįēčĄãŽcommonvoiceããŧãŋãģãããĢį¨æãããĻãããéŗåŖ°ããĄã¤ãĢīŧ.mp3īŧãĢã¤ããĻã¯ããĩãŗããĢãŦãŧãã48,000Hzã¨ãĒãŖãĻããžãã ããŽãĩãŗããĢãŦãŧããŽéãããyukarinã§ãŽåĻįãĢãŠãåŊąéŋãããã¯æ°ãĢãĒãã¨ããã§ãã åŋĩãŽįēãcommonvoiceãŽéŗåŖ°ããĄã¤ãĢãĢã¤ããĻããĩãŗããĢãŦãŧãã48,000Hzãã24,000Hzã¸ã¨å¤æããåĻįããããã§åŽæŊããĻãããã¨æããžãã
å°ãä¸č¨åĻįãĢã¯ilbrosa
ãŠã¤ããŠãĒãį¨ããžãããlibrosa
ã¯.mp3
ããĩããŧããããĻããĒãã¨ãŽãã¨ã§ã.mp3
æĄåŧĩåãŽããĄã¤ãĢã.wav
ãĢå¤æããåŋ
čĻããããžãã
ãããåĻįã¯ãäģĨä¸ãŽč¨äēãåčãĢãããĻé ããäģĨä¸ãŽãŗãŧãã§åŽæŊãããžãã https://algorithm.joho.info/programming/python/pydub-mp3-wav/ https://note.com/npaka/n/n6f421b546024
import pydub
import librosa
import soundfile as sf
for filename_tmp in filename:
# å
ĨåããĄã¤ãĢåį§°ãããåēåããĄã¤ãĢåį§°ãįæīŧåæšãŽį°åĸãĢåãããĻâĻīŧ
filename_out = filename_tmp.replace('_48000Hz', '') # ããŠãĢãåå¤æ´īŧį§ãŽå ´åâĻīŧ
filename_out = filename_out.replace('.mp3', '.wav') # æĄåŧĩåãŽå¤æ´
# .mp3ãčĒčžŧ
sound = pydub.AudioSegment.from_mp3(filename_tmp)
# .wavãĢãĻæ¸åē
sound.export(filename_out, format="wav")
# .wavãčĒčžŧ
y, sr = librosa.core.load(filename_out, sr=24000, mono=True)
# .wavãĢãĻæ¸åēīŧ16bitã§æ¸ãčžŧãŋīŧ
sf.write(filename_out, y, sr, subtype="PCM_16")
ããĄãã¯ãCPUã§ãŽåĻįãĢãĒããžããŽã§ãããĄã¤ãĢæ°ãå¤ãã¨ããããĒããŽæéãčĻããžãã å°ãå¤æååžããį§ãŽčŗã§æ¯čŧããéãã¯ãæŽãŠéŗãŽåŖåã¯įĄããããĢæããžããã
éŗåŖ°ããĄã¤ãĢãŽãĩãŗããĢãŦãŧããåãããã¨ããã§ãéŗéŋįšåž´ãŽåãåēããåŽæŊããĻãŋãžãã æšããžããĻãåŽčˇĩãŗããŗãã¯äģĨä¸ã§ãã
python scripts/extract_acoustic_feature.py \
-i './input_wav/*' \
-o './input_feature/'
python scripts/extract_acoustic_feature.py \
-i './target_wav/*' \
-o './target_feature/'
ãŗããŗããåŽæŊããã¨ãããäģĨä¸ãŽã¨ãŠãŧãįēįããžããã
Traceback (most recent call last):
File "scripts/extract_acoustic_feature.py", line 13, in <module>
from yukarin.acoustic_feature import AcousticFeature
File "/hoge/yukarin/yukarin/__init__.py", line 1, in <module>
from .acoustic_converter import AcousticConverter
File "/hoge/yukarin/yukarin/acoustic_converter.py", line 7, in <module>
import librosa
File "/opt/conda/lib/python3.7/site-packages/librosa/__init__.py", line 12, in <module>
from . import core
File "/opt/conda/lib/python3.7/site-packages/librosa/core/__init__.py", line 109, in <module>
from .time_frequency import * # pylint: disable=wildcard-import
File "/opt/conda/lib/python3.7/site-packages/librosa/core/time_frequency.py", line 10, in <module>
from ..util.exceptions import ParameterError
File "/opt/conda/lib/python3.7/site-packages/librosa/util/__init__.py", line 71, in <module>
from . import decorators
File "/opt/conda/lib/python3.7/site-packages/librosa/util/decorators.py", line 9, in <module>
from numba.decorators import jit as optional_jit
ModuleNotFoundError: No module named 'numba.decorators'
ããŽã¨ãŠãŧã¯ãpip install numba==0.48
ãåŽæŊãããã¨ã§č§Ŗæąēãããžããã
æ°ããããŧã¸ã§ãŗãŽnumbaã§ããã¨ãåēãĻããžãã¨ãŠãŧã§ããããã§ãã
īŧåčīŧhttps://github.com/librosa/librosa/issues/1160 īŧ
寞åŋããã¨ãããäģĨä¸ãã°ãåēåããåĻįãæŖ常įĩäēããžããã
{'alpha': 0.466,
'dtype': 'float32',
'enable_overwrite': False,
'f0_ceil': 800,
'f0_floor': 71,
'fft_length': 1024,
'frame_period': 5,
'ignore_feature': ['sp', 'ap'],
'input_glob': './input_wav/*',
'order': 8,
'output': PosixPath('input_feature'),
'pad_second': 0,
'sampling_rate': 24000,
'sampling_rate_for_thresholding': None,
'threshold_db': None}
100%|âââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 23398/23398 [34:56<00:00, 11.16it/s]
{'alpha': 0.466,
'dtype': 'float32',
'enable_overwrite': False,
'f0_ceil': 800,
'f0_floor': 71,
'fft_length': 1024,
'frame_period': 5,
'ignore_feature': ['sp', 'ap'],
'input_glob': './target_wav/*',
'order': 8,
'output': PosixPath('target_feature'),
'pad_second': 0,
'sampling_rate': 24000,
'sampling_rate_for_thresholding': None,
'threshold_db': None}
100%|âââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 23398/23398 [28:54<00:00, 13.49it/s]
ããŽåĻįããinput_wav
ããŠãĢãã¨ãtarget_wav
ããŠãĢãã¨ãĢ寞ããĻåŽæŊããžãã
ããã¨ãinput_feature
ããŠãĢãã¨ãtarget_feature
ããŠãĢãã¨ãŽé
ä¸ãĢãéŗåŖ°ããĄã¤ãĢãåãæ°ã ããŽ.npy
ãįæãããžãã
# tree ./*_feature/
./input_feature/
âââ arguments.json
âââ common_voice_ja_19482480.npy
âââ common_voice_ja_19482491.npy
âââ common_voice_ja_19482498.npy
âââ âĻ
âââ common_voice_ja_27446518.npy
âââ common_voice_ja_27446519.npy
âââ common_voice_ja_27446520.npy
./target_feature/
âââ arguments.json
âââ common_voice_ja_19482480.npy
âââ common_voice_ja_19482491.npy
âââ common_voice_ja_19482498.npy
âââ âĻ
âââ common_voice_ja_27446518.npy
âââ common_voice_ja_27446519.npy
âââ common_voice_ja_27446520.npy
0 directories, 46798 files
ããŽnpyããĄã¤ãĢãĢæ ŧį´ãããĻããå 厚ã§ãããå¯čĻåããĻãŋãã¨äģĨä¸ãŽãããĢãĒããžããã
åčãžã§ãĢãå
ããå
ããŧãŋ .wav
ããlibrosaã§loadīŧyukarinå
ãŽåĻįã¨åæ§īŧãããæŗĸåŊĸããŧãŋãå¯čĻåããžãã
æŦĄãĢãåēåįšåž´ãŽå ãŽãåēæŦå¨æŗĸæ°īŧéŗåŖ°ãŽå¨ææ§ã襨įžããéŗéĢãå¸ãéŗéŋįšåž´éīŧãŽå¯čĻåã§ãã
æŦĄãĢãäģĨä¸ãŗãŧããĢãĻæŊåēãããã§ãããįšåž´ã§ãããããĄã㯠nan
ã¨ãĒãŖãĻããžããã
sp = pyworld.cheaptrick(x, f0, t, fs, fft_size=fft_length)
ap = pyworld.d4c(x, f0, t, fs, fft_size=fft_length)
feature1.sp = nan
feature2.sp = nan
feature1.ap = nan
feature2.ap = nan
æŦĄãĢãåēåįšåž´ãŽå ãŽããŗãŧããŽéå¨ææ§ã¨ããããŽãŽå¯čĻåã§ãã æåŗåããäŊŋãæšãĢã¤ããĻã¯ãããããåæãé˛ããéį¨ãĢãĻãåŋ čĻãĢåŋããĻæãä¸ãããã¨æããžãã
æŦĄãĢãåēåįšåž´ãŽå ãŽããĄãĢãąããšããŠã ãŽå¯čĻåã§ãã
æåžãĢãåēåįšåž´ãŽå ãŽãįēåŖ°ãŋã¤ããŗã°ãŽå¯čĻåã§ãã
æŦĄãŽæé ã¯ãããŧãŋãæããã¨ãŽãã¨ã§ãã ããã¯ããĒãã¸ããĒįŽĄįč HiroshibaãããŽäģĨä¸č¨äēãĢč¨čŧãããå 厚ãĢéĸéŖããã¨ãããã¨æãããžãã https://blog.hiroshiba.jp/sandbox-alignment-voice-actress-data/
åĻįãŗããŗãã¯ãäģĨä¸ã¨ãŽãã¨ã§ãã
python scripts/extract_align_indexes.py \
-i1 './input_feature/*.npy' \
-i2 './target_feature/*.npy' \
-o './aligned_indexes/'
ãããåŽæŊããã¨ãäģĨä¸ãã°ãåēåãŽä¸ã§ãæŖ常įĩäēããžããã
# python scripts/extract_align_indexes.py \
> -i1 './input_feature/*.npy' \
> -i2 './target_feature/*.npy' \
> -o './aligned_indexes/'
{'dtype': 'int32',
'enable_overwrite': False,
'ignore_feature': ('feature1', 'feature2'),
'input_glob1': './input_feature/*.npy',
'input_glob2': './target_feature/*.npy',
'output': PosixPath('aligned_indexes')}
100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââââ| 23398/23398 [01:25<00:00, 273.90it/s]
åĻįįĩäēåžãĢãåēåããŠãĢããįĸēčĒããã¨ãããĄããĢã.npy
ããĄã¤ãĢãæ ŧį´ãããžããã
# tree aligned_indexes/
aligned_indexes/
âââ arguments.json
âââ common_voice_ja_19482480.npy
âââ common_voice_ja_19482491.npy
âââ common_voice_ja_19482498.npy
âââ âĻ
âââ common_voice_ja_27446518.npy
âââ common_voice_ja_27446519.npy
âââ common_voice_ja_27446520.npy
0 directories, 23399 files
ãĒãã¸ããĒįŽĄįč
ãŽHiroshibaããããåĨéč¨čŧããĻä¸ããŖãĻããBLOGãĢããã°ãinput_wav
ã¨target_wav
ãŽä¸ĄæšãæéčĒŋæ´ããããŽããæ ŧį´ãããĻããæ§åã§ãã
scripts/extract_align_indexes.pyãŽãŗãŧãå
厚ãįĸēčĒãããĻãŋãžãã¨ããã㯠align_indexes
ã¨ããåį§°ãŽã¤ãŗããã¯ãšæ
å ąãæ ŧį´ãããĻããããã§ããã
åŽčˇĩæšæŗã¨ããĻãHiroshibaããããnnmnkwiiãĒãã¸ããĒããč¸čĨ˛ããåŊĸã§åŽčŖ
ããããŽãį¨ããĻããããã§ãã
ããŽä¸ãŽãcoreãĒåĻįã¨ããĻ㯠fastdtw
ãį¨ããæįŗģåããŧãŋéčˇéĸæ¸ŦåŽæŠčŊãį¨ããĻããããã§ãã
fastdtw
ãĢã¤ããĻã¯ãäģĨä¸ãŽč¨äēãåčãĢãĒããžããã
https://irukanobox.blogspot.com/2020/07/dtw.html
ããããĻæŊåēãããããŽãå¯čĻåããĻãŋãžãã¨ãäģĨä¸ãŽãããĒã¤ãŗããã¯ãšãŽæ
å ąã¨ãĒãŖãĻããžããã
ãããå
ãĢãinput_wav
㨠target_wav
ãŽįēåŖ°ãŋã¤ããŗã°ãįēåŖ°åēéãåããčžŧãããŽã¨æãããžãã
åšžã¤ããŽéŗåŖ°ããŧãŋãĢã¤ããĻãåēåãããįĩæãč˛ŧãŖãĻãããžãã
ãŠããããéŗåŖ°ãŽåããæšã¨ããĻã¯ãinput
ã target
ãŽãŠãĄããããŽéŗåŖ°ãé
ããããã¨ã§ãåŽįžãããčããŽãããĢæãããžãã
īŧâģéč˛įˇã input
ããĒãŦãŗã¸įˇã target
ã¨ãĒããžããīŧ
ããŧãŋäŊæãĢãããæåžãŽæé ãĢãĒããžãã äģĨä¸ãŗããŗããŽåŽæŊãĢãĻãå¨æŗĸæ°ãŽįĩąč¨éãæąããã¨ãŽãã¨ã§ãã
python scripts/extract_f0_statistics.py \
-i './input_feature/*.npy' \
-o './input_statistics.npy'
python scripts/extract_f0_statistics.py \
-i './target_feature/*.npy' \
-o './target_statistics.npy'
ãŗããŗããåŽæŊããĻãŋãžãã¨ãäģĨä¸ãŽãããĒåŊĸãĢãĻãæŖ常įĩäēããžããã
# python scripts/extract_f0_statistics.py \
> -i './input_feature/*.npy' \
> -o './input_statistics.npy'
{'input_glob': './input_feature/*.npy',
'output': PosixPath('input_statistics.npy')}
100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââ| 23398/23398 [00:02<00:00, 10279.13it/s]
# python scripts/extract_f0_statistics.py \
> -i './target_feature/*.npy' \
> -o './target_statistics.npy'
{'input_glob': './target_feature/*.npy',
'output': PosixPath('target_statistics.npy')}
100%|ââââââââââââââââââââââââââââââââââââââââââââââââââââ| 23398/23398 [00:01<00:00, 12385.32it/s]
åēåãããããĄã¤ãĢã¯ãäģĨä¸ãŽãããĢãĒããžããã
# ls -l *_statistics.npy
-rw-r--r-- 1 root root 416 Dec 12 13:04 input_statistics.npy
-rw-r--r-- 1 root root 416 Dec 12 13:04 target_statistics.npy
config.json
ãäŊãīŧå į¨ãžã§ãŽæé ãĢãĻãåĻįŋããŧãŋãæēåã§ããã¨æãããįēãæŦĄãŽæé ãŽåĻįŋį¨č¨åŽããĄã¤ãĢãŽäŊæãĢé˛ãŋããã¨æããžãã
åĻįŋãŽč¨åŽã¯ãããĄã¤ãĢ sample_config.json
ãĢãĻ襨įžãããã¨ãŽãã¨ã§ãã
ã¨ããããã¨ãããã¨ã§ããã°ãinput_glob
ãtarget_glob
ãindexes_glob
ãå¤æ´ããã°åãã¨ãŽãã¨ã§ãã
sample_config.json
ãŽä¸čēĢã¯äģĨä¸ã¨ãĒãŖãĻããžãã
{
"dataset": {
"acoustic_param": {
"alpha": 0.410,
"dtype": "float32",
"f0_ceil": 800,
"f0_floor": 71,
"fft_length": 1024,
"frame_period": 5,
"order": 8,
"pad_second": 0,
"sampling_rate": 24000,
"threshold_db": 25
},
"input_glob": "./input_feature/*.npy",
"target_glob": "./target_feature/*.npy",
"indexes_glob": "./aligned_indexes/*.npy",
"in_features": [
"mc"
],
"out_features": [
"mc"
],
"train_crop_size": 512,
"input_global_noise": 0.01,
"input_local_noise": 0.01,
"target_global_noise": 0.01,
"target_local_noise": 0.01,
"seed": 0,
"num_test": 5
},
"model": {
"in_channels": 9,
"out_channels": 9,
"generator_base_channels": 8,
"generator_extensive_layers": 8,
"discriminator_base_channels": 1,
"discriminator_extensive_layers": 5,
"weak_discriminator": true
},
"loss": {
"adversarial": 0,
"mse": 100
},
"project": {
"name": "",
"tags": []
},
"train": {
"batchsize": 8,
"gpu": 0,
"log_iteration": 250,
"snapshot_iteration": 10000,
"stop_iteration": null,
"optimizer": {
"alpha": 0.0002,
"beta1": 0.5,
"beta2": 0.999,
"name": "Adam"
}
}
}
input_glob
ãtarget_glob
ãindexes_glob
ãĢã¤ããĻããæé éããĢåŽæŊããã¨ããã§ãå¤æ´ãŽåŋ
čĻããĒãããã§ããŽã§ãããŽãžãžãŽå
厚ã¨ããžãã
ãããããæŦĄãŽæé ãĢãĻãåĻįŋãŽåŽæŊã¨ãĒããžãã åŽæŊã¯ãäģĨä¸ãŗããŗãã¨ãŽãã¨ã§ãã
python train.py \
sample_config.json \
./model_stage1/
ããã§ãåēį¤ãĢãĻinstallãĢå¤ąæãã cupy
ã§ã¨ãŠãŧãįēåŖ°ããžããã
# python train.py \
> sample_config.json \
> ./model_stage1/
Not found cupy.
Traceback (most recent call last):
File "train.py", line 35, in <module>
cuda.get_device_from_id(config.train.gpu).use()
File "/opt/conda/lib/python3.7/site-packages/chainer/backends/cuda.py", line 163, in get_device_from_id
check_cuda_available()
File "/opt/conda/lib/python3.7/site-packages/chainer/backends/cuda.py", line 93, in check_cuda_available
raise RuntimeError(msg)
RuntimeError: CUDA environment is not correctly set up
(see https://github.com/chainer/chainer#installation).No module named 'cupy'
ã¨ãŠãŧãã°ãĢč¨čŧãããĻãããĒãŗã¯ãããäģĨä¸ãŽãĒãŗã¯ãĢčžŋãįããã¨ãã§ãããããĢč¨čŧãããĻãããŗããŗããŽåŽæŊãĢãĻãcupy
ãã¤ãŗãšããŧãĢããĻãŋãžããã
https://docs.cupy.dev/en/stable/install.html
pip install cupy-cuda112
ããŽåžãimport cupy
ãåŽæŊããĻãŋãžããã¨ãããäģĨä¸ãŽã¨ãŠãŧãįēįããžããã
ImportError: libnvrtc.so.11.2: cannot open shared object file: No such file or directory
ããĄããĢã¤ããĻãnvidia-smi
ãåŽčĄããéãŽCUDAãŽversionã 11.2
ã¨ãĒãŖãĻããįēãĢãpip install cupy-cuda112
ã¨ããæŦĄįŦŦã§ããããäģĨä¸č¨äēãĢããã°ããããčĒčéããŽããã§ããã
https://blog.mktia.com/get-cuda-and-cudnn-version/
nvidia-smi
ã§ã CUDA ãŽããŧã¸ã§ãŗãããããŽã¯čĄ¨į¤ēãããžããīŧããŠã¤ãã寞åŋããĻãã CUDA ãŽããŧã¸ã§ãŗã襨į¤ēããĻãããĢéããĒãã¨ãŽãã¨ã§ãã
äģŖãããĢãį§ãŽį°åĸã§ãã¨ã/usr/local/cuda/bin/nvcc --version
ã¨ãããŗããŗããĢãĻãCUDAãŽããŧã¸ã§ãŗãįĸēčĒãããã¨ãã§ããæŖãã㯠11.1
ã§ãããã¨ãå¤æããžããã
# /usr/local/cuda/bin/nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Mon_Oct_12_20:09:46_PDT_2020
Cuda compilation tools, release 11.1, V11.1.105
Build cuda_11.1.TC455_06.29190527_0
ããŽįēãäģĨä¸ãĢãĻã¤ãŗãšããŧãĢãååŽæŊããžããã
pip uninstall cupy-cuda112
pip install cupy-cuda111
ããã¨ãã¨ãŠãŧãįēįãããĢãåĻįãåãå§ããžããã å°ãcupyãĢã¤ããĻã¯ãchainerãäŊŋį¨ãããĢåŊããŖãĻãã¤ãŗãšããŧãĢãããĻããåŋ čĻãããããŽã§ããããĒãã§yukarinãĒãã¸ããĒãŽåĻįŋãčĄããã¨ã¯ãéŖããããã§ããã īŧåŊåãŽå¤æãčĒ¤ãŖãĻãããžãããīŧ https://github.com/chainer/chainer#installation
å°ãcupy
ãåãŗãcupy-cudaXXX
ãŽã¤ãŗãšããŧãĢãĢã¤ããžããĻã¯ãč¨åŽããĄã¤ãĢãã¤ãŗãšããŧãĢæãĢä¸æ¸ãããããããã§ãč¤æ°ã¤ãŗãšããŧãĢããããĻããå ´åã¯ãæåžãĢã¤ãŗãšããŧãĢããããŽããimport cupy
æãĢåį
§ãããããã§ãã
ã¤ãžããå
ąåãã§ããĒãåŊĸã§ããįēãå
¨ãĻãŽcupyãããŖããã¨uninstallããäŊããŽcupyãinstallãããĻããĒãįļæ
ã§ãææãŽcupy-cudaXXX
ãinstallããåŋ
čĻããããžãã
åĻįŋãåãå§ããžãã¨ãtrain.py
ãŽåŽčĄæãĢåŧæ°ã¨ããĻæ¸ĄããããŠãĢãé
ä¸ãĢããĸããĢããĄã¤ãĢããã°ãåēåãããåŊĸã¨ãĒããžããã
īŧåĻįŋåŽæŊãŗããŗãåæ˛īŧ
python train.py \
sample_config.json \
./model_stage1/
ãâŧ
# ls model_stage1
cg.dot predictor_10000.npz predictor_50000.npz
config.json predictor_20000.npz predictor_60000.npz
events.out.tfevents.1639306034.7217833e7f1b predictor_30000.npz predictor_70000.npz
log predictor_40000.npz
å°ãåĻįŋãŽåŽæŊãĢã¤ããĻãįĩäēEpochæ°ãæåŽããįŽæãčĻåŊããããCtrl + C
įã§åĻįãæĸããĒãéããåĻįŋåĻįãįĩãããĒãããã§ããã
ããŧãŋæ° 23398
ãããããĩã¤ãē 128
ãĢãĻã10ã20æéį¨ãåĻįŋåĻįãåããĻãŋãžãã
ããŽåžããĄãããŠ20æéį¨åĻįŋãåããåĻįŋãĸããĢãä¸åŽãŽiterationééã§č¤æ°äŋåãããžããã
yukarin# tree model_stage1
model_stage1
âââ cg.dot
âââ config.json
âââ events.out.tfevents.1639306034.7217833e7f1b
âââ log
âââ predictor_10000.npz
âââ predictor_100000.npz
âââ predictor_110000.npz
âââ predictor_120000.npz
âââ predictor_130000.npz
âââ predictor_140000.npz
âââ predictor_150000.npz
âââ predictor_160000.npz
âââ predictor_170000.npz
âââ predictor_180000.npz
âââ predictor_190000.npz
âââ predictor_20000.npz
âââ predictor_200000.npz
âââ predictor_210000.npz
âââ predictor_220000.npz
âââ predictor_230000.npz
âââ predictor_240000.npz
âââ predictor_250000.npz
âââ predictor_260000.npz
âââ predictor_270000.npz
âââ predictor_280000.npz
âââ predictor_290000.npz
âââ predictor_30000.npz
âââ predictor_300000.npz
âââ predictor_310000.npz
âââ predictor_320000.npz
âââ predictor_330000.npz
âââ predictor_340000.npz
âââ predictor_350000.npz
âââ predictor_360000.npz
âââ predictor_370000.npz
âââ predictor_380000.npz
âââ predictor_390000.npz
âââ predictor_40000.npz
âââ predictor_400000.npz
âââ predictor_410000.npz
âââ predictor_420000.npz
âââ predictor_430000.npz
âââ predictor_440000.npz
âââ predictor_450000.npz
âââ predictor_460000.npz
âââ predictor_470000.npz
âââ predictor_480000.npz
âââ predictor_490000.npz
âââ predictor_50000.npz
âââ predictor_500000.npz
âââ predictor_510000.npz
âââ predictor_520000.npz
âââ predictor_530000.npz
âââ predictor_540000.npz
âââ predictor_550000.npz
âââ predictor_560000.npz
âââ predictor_570000.npz
âââ predictor_580000.npz
âââ predictor_590000.npz
âââ predictor_60000.npz
âââ predictor_600000.npz
âââ predictor_610000.npz
âââ predictor_620000.npz
âââ predictor_630000.npz
âââ predictor_640000.npz
âââ predictor_650000.npz
âââ predictor_660000.npz
âââ predictor_670000.npz
âââ predictor_680000.npz
âââ predictor_690000.npz
âââ predictor_70000.npz
âââ predictor_700000.npz
âââ predictor_710000.npz
âââ predictor_720000.npz
âââ predictor_730000.npz
âââ predictor_740000.npz
âââ predictor_750000.npz
âââ predictor_760000.npz
âââ predictor_770000.npz
âââ predictor_780000.npz
âââ predictor_790000.npz
âââ predictor_80000.npz
âââ predictor_800000.npz
âââ predictor_810000.npz
âââ predictor_820000.npz
âââ predictor_830000.npz
âââ predictor_840000.npz
âââ predictor_90000.npz
0 directories, 88 files
logãĢãåĻįŋãŽlossãč¨é˛ããããŽã§ãããããŽįĩéã¯äģĨä¸ãŽããã§ããã
{
"predictor/mse": 0.35239657759666443,
"predictor/adversarial": 1.009699821472168,
"predictor/loss": 35.23966979980469,
"discriminator/real": 0.21418224275112152,
"discriminator/fake": 0.49350425601005554,
"discriminator/loss": 0.7076864838600159,
"discriminator/accuracy": 0.95191650390625,
"discriminator/precision": 0.9904609629602085,
"discriminator/recall": 0.9126904296875,
"test/predictor/mse": 0.4434267282485962,
"test/predictor/adversarial": 0.6913116574287415,
"test/predictor/loss": 44.342674255371094,
"test/discriminator/real": 0.7373173236846924,
"test/discriminator/fake": 0.7232625484466553,
"test/discriminator/loss": 1.4605798721313477,
"test/discriminator/accuracy": 0.44375,
"test/discriminator/precision": 0.125,
"test/discriminator/recall": 0.01875,
"train/predictor/mse": 0.2641863226890564,
"train/predictor/adversarial": 0.8317578434944153,
"train/predictor/loss": 26.41863250732422,
"train/discriminator/real": 0.7838372588157654,
"train/discriminator/fake": 0.6074777841567993,
"train/discriminator/loss": 1.39131498336792,
"train/discriminator/accuracy": 0.509375,
"train/discriminator/precision": 1.0,
"train/discriminator/recall": 0.01875,
"epoch": 5,
"iteration": 1000,
"elapsed_time": 81.01380289904773
},
ãâŧ
{
"predictor/mse": 0.3256767988204956,
"predictor/adversarial": 5.032227993011475,
"predictor/loss": 32.56768035888672,
"discriminator/real": 0.05071548372507095,
"discriminator/fake": 0.038136936724185944,
"discriminator/loss": 0.0888524278998375,
"discriminator/accuracy": 0.98839599609375,
"discriminator/precision": 0.9971560586514624,
"discriminator/recall": 0.9796044921875,
"test/predictor/mse": 0.35689324140548706,
"test/predictor/adversarial": 3.7905514240264893,
"test/predictor/loss": 35.68932342529297,
"test/discriminator/real": 2.556612491607666,
"test/discriminator/fake": 0.02566264010965824,
"test/discriminator/loss": 2.582275152206421,
"test/discriminator/accuracy": 0.509375,
"test/discriminator/precision": 1.0,
"test/discriminator/recall": 0.01875,
"train/predictor/mse": 0.2284717708826065,
"train/predictor/adversarial": 3.8663880825042725,
"train/predictor/loss": 22.847177505493164,
"train/discriminator/real": 2.8593335151672363,
"train/discriminator/fake": 0.02544046752154827,
"train/discriminator/loss": 2.8847739696502686,
"train/discriminator/accuracy": 0.503125,
"train/discriminator/precision": 1.0,
"train/discriminator/recall": 0.00625,
"epoch": 54,
"iteration": 10000,
"elapsed_time": 798.8362969011068
},
ãâŧ
{
"predictor/mse": 0.3052104711532593,
"predictor/adversarial": 6.364946365356445,
"predictor/loss": 30.521047592163086,
"discriminator/real": 0.003857325529679656,
"discriminator/fake": 0.00210120202973485,
"discriminator/loss": 0.005958528723567724,
"discriminator/accuracy": 0.99937744140625,
"discriminator/precision": 0.9999804520455513,
"discriminator/recall": 0.9987744140625,
"test/predictor/mse": 0.3660505414009094,
"test/predictor/adversarial": 1.631148006708827e-05,
"test/predictor/loss": 36.60505294799805,
"test/discriminator/real": 9.940130257746205e-05,
"test/discriminator/fake": 13.819330215454102,
"test/discriminator/loss": 13.819429397583008,
"test/discriminator/accuracy": 0.5,
"test/discriminator/precision": 0.5,
"test/discriminator/recall": 1.0,
"train/predictor/mse": 0.22484809160232544,
"train/predictor/adversarial": 1.64138382388046e-05,
"train/predictor/loss": 22.48480987548828,
"train/discriminator/real": 5.373924068408087e-05,
"train/discriminator/fake": 13.840730667114258,
"train/discriminator/loss": 13.840784072875977,
"train/discriminator/accuracy": 0.5,
"train/discriminator/precision": 0.5,
"train/discriminator/recall": 1.0,
"epoch": 547,
"iteration": 100000,
"elapsed_time": 8113.639713731012
},
ãâŧ
{
"predictor/mse": 0.30366745591163635,
"predictor/adversarial": 7.030780792236328,
"predictor/loss": 30.366737365722656,
"discriminator/real": 0.010039892978966236,
"discriminator/fake": 0.0017708293162286282,
"discriminator/loss": 0.011810722760856152,
"discriminator/accuracy": 0.9988623046875,
"discriminator/precision": 0.999941303506524,
"discriminator/recall": 0.997783203125,
"test/predictor/mse": 0.35088080167770386,
"test/predictor/adversarial": 6.8735448621737305e-06,
"test/predictor/loss": 35.08808135986328,
"test/discriminator/real": 0.0005439310916699469,
"test/discriminator/fake": 16.6959228515625,
"test/discriminator/loss": 16.69646644592285,
"test/discriminator/accuracy": 0.5,
"test/discriminator/precision": 0.5,
"test/discriminator/recall": 1.0,
"train/predictor/mse": 0.24123618006706238,
"train/predictor/adversarial": 6.821536317147547e-06,
"train/predictor/loss": 24.12361717224121,
"train/discriminator/real": 0.0016821377212181687,
"train/discriminator/fake": 16.701169967651367,
"train/discriminator/loss": 16.702852249145508,
"train/discriminator/accuracy": 0.5,
"train/discriminator/precision": 0.5,
"train/discriminator/recall": 1.0,
"epoch": 1094,
"iteration": 200000,
"elapsed_time": 16456.76111229905
},
ãâŧ
{
"predictor/mse": 0.2946970760822296,
"predictor/adversarial": 8.057758331298828,
"predictor/loss": 29.469711303710938,
"discriminator/real": 0.002485891105607152,
"discriminator/fake": 0.0005037991795688868,
"discriminator/loss": 0.0029896902851760387,
"discriminator/accuracy": 0.99978515625,
"discriminator/precision": 0.9999804735172078,
"discriminator/recall": 0.99958984375,
"test/predictor/mse": 0.3218367099761963,
"test/predictor/adversarial": 2.8206122806295753e-06,
"test/predictor/loss": 32.18367004394531,
"test/discriminator/real": 1.4969022004152066e-06,
"test/discriminator/fake": 18.49074363708496,
"test/discriminator/loss": 18.490745544433594,
"test/discriminator/accuracy": 0.5,
"test/discriminator/precision": 0.5,
"test/discriminator/recall": 1.0,
"train/predictor/mse": 0.22985798120498657,
"train/predictor/adversarial": 2.979112196044298e-06,
"train/predictor/loss": 22.985797882080078,
"train/discriminator/real": 6.583236972801387e-05,
"train/discriminator/fake": 18.465129852294922,
"train/discriminator/loss": 18.46519660949707,
"train/discriminator/accuracy": 0.5,
"train/discriminator/precision": 0.5,
"train/discriminator/recall": 1.0,
"epoch": 2188,
"iteration": 400000,
"elapsed_time": 33794.79855498602
},
ãâŧ
{
"predictor/mse": 0.2946617007255554,
"predictor/adversarial": 8.973788261413574,
"predictor/loss": 29.466167449951172,
"discriminator/real": 0.0021689562126994133,
"discriminator/fake": 0.00024469412164762616,
"discriminator/loss": 0.002413650043308735,
"discriminator/accuracy": 0.999833984375,
"discriminator/precision": 0.9999902200488998,
"discriminator/recall": 0.999677734375,
"test/predictor/mse": 0.33563244342803955,
"test/predictor/adversarial": 9.00166441386574e-10,
"test/predictor/loss": 33.5632438659668,
"test/discriminator/real": 2.0852203519439172e-08,
"test/discriminator/fake": 30.933984756469727,
"test/discriminator/loss": 30.933984756469727,
"test/discriminator/accuracy": 0.5,
"test/discriminator/precision": 0.5,
"test/discriminator/recall": 1.0,
"train/predictor/mse": 0.26031461358070374,
"train/predictor/adversarial": 7.850932681741085e-10,
"train/predictor/loss": 26.031461715698242,
"train/discriminator/real": 1.5840148748225147e-08,
"train/discriminator/fake": 31.085153579711914,
"train/discriminator/loss": 31.085153579711914,
"train/discriminator/accuracy": 0.5,
"train/discriminator/precision": 0.5,
"train/discriminator/recall": 1.0,
"epoch": 4596,
"iteration": 840000,
"elapsed_time": 74940.43543752504
},
æŦĄãĢãããšããčĄãŖãĻãŋãžãã å ãã¯ãåĻįŋãĢį¨ããããŧãŋããŠãį¨ä¸æãå¤æã§ããããįĸēčĒããžãã å°ãåĻįŋããŧãŋã¯ã寞ã¨ãĒããģãããŽéŗåŖ°ããŧãŋã23,398Ã2åã¨ãį¸åŊæ°åå¨ããžãã
åĻįŋããŧãŋãĢãĻãããšããčĄããŗããŗãã¯ãäģĨä¸ã¨ãŽãã¨ã§ãã
python scripts/voice_change.py \
--model_dir './model_stage1' \
--config_path './model_stage1/config.json' \
--input_statistics 'input_statistics.npy' \
--target_statistics 'target_statistics.npy' \
--output_sampling_rate 24000 \
--disable_dataset_test \
--test_wave_dir './input_wav/' \
--output_dir './output/'
åŽčĄããĻãŋãžãã¨ãäģĨä¸ãŽã¨ãŠãŧãįēįããžããã
Traceback (most recent call last):
File "scripts/voice_change.py", line 11, in <module>
from yukarin import AcousticConverter
File "/docker/ax/20211128_yukarin/yukarin/__init__.py", line 1, in <module>
from .acoustic_converter import AcousticConverter
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 14, in <module>
from yukarin.dataset import decode_feature
File "/docker/ax/20211128_yukarin/yukarin/dataset.py", line 10, in <module>
from yukarin.align_indexes import AlignIndexes
File "/docker/ax/20211128_yukarin/yukarin/align_indexes.py", line 5, in <module>
from become_yukarin.dataset.utility import MelCepstrumAligner
ModuleNotFoundError: No module named 'become_yukarin'
ããĄãã¯ãrequirements.txt
ãĢč¨čŧãããĻããããŽãŽãã¤ãŗãšããŧãĢãæŧããĻããããŽã§ããã
äģĨä¸ãŗããŗããĢãĻãã¤ãŗãšããŧãĢãåŽæŊããžãã
pip install git+https://github.com/Hiroshiba/become-yukarin
ååēĻãå į¨ãŽããšããŗããŗããåŽæŊããžãã ããã¨ãäģĨä¸ã¨ãŠãŧãã°ãåēåãããžããã
Loaded acoustic converter model "model_stage1/predictor_840000.npz"
Traceback (most recent call last):
File "scripts/voice_change.py", line 67, in process
p_in = Path(glob.glob(str(dataset_wave_dir / p_in.stem) + '.*')[0])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
Traceback (most recent call last):
File "scripts/voice_change.py", line 67, in process
p_in = Path(glob.glob(str(dataset_wave_dir / p_in.stem) + '.*')[0])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
Traceback (most recent call last):
File "scripts/voice_change.py", line 67, in process
p_in = Path(glob.glob(str(dataset_wave_dir / p_in.stem) + '.*')[0])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
Traceback (most recent call last):
File "scripts/voice_change.py", line 67, in process
p_in = Path(glob.glob(str(dataset_wave_dir / p_in.stem) + '.*')[0])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
Traceback (most recent call last):
File "scripts/voice_change.py", line 67, in process
p_in = Path(glob.glob(str(dataset_wave_dir / p_in.stem) + '.*')[0])
TypeError: unsupported operand type(s) for /: 'NoneType' and 'str'
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 922 but corresponding boolean dimension is 921
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 692 but corresponding boolean dimension is 691
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 1076 but corresponding boolean dimension is 1075
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 692 but corresponding boolean dimension is 691
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 692 but corresponding boolean dimension is 691
Traceback (most recent call last):
File "scripts/voice_change.py", line 75, in process
f_in_effective, effective = acoustic_converter.separate_effective(wave=w_in, feature=f_in, threshold=threshold)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 92, in separate_effective
feature = feature.indexing(effective)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 86, in indexing
f0=self.f0[index] if _is_target(self.f0) else numpy.nan,
IndexError: boolean index did not match indexed array along dimension 0; dimension is 692 but corresponding boolean dimension is 691
^CTraceback (most recent call last):
File "scripts/voice_change.py", line 133, in <module>
main()
File "scripts/voice_change.py", line 127, in main
Traceback (most recent call last):
list(multiprocessing.Pool().map(process_partial, paths_test))
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 268, in map
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 169, in decode_spectrogram
alpha=pysptk.util.mcepalpha(self.out_sampling_rate),
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in mcepalpha
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in <listcomp>
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 192, in _warping_vector
omega = step * np.arange(0, length)
KeyboardInterrupt
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 651, in get
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 149, in mc2sp
symc[i] = c[i]
KeyboardInterrupt
self.wait(timeout)
File "/opt/conda/lib/python3.7/multiprocessing/pool.py", line 648, in wait
self._event.wait(timeout)
File "/opt/conda/lib/python3.7/threading.py", line 552, in wait
signaled = self._cond.wait(timeout)
File "/opt/conda/lib/python3.7/threading.py", line 296, in wait
waiter.acquire()
KeyboardInterrupt
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 169, in decode_spectrogram
alpha=pysptk.util.mcepalpha(self.out_sampling_rate),
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in mcepalpha
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in <listcomp>
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 195, in _warping_vector
warpfreq = np.arctan(num / den)
KeyboardInterrupt
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 154, in mc2sp
return np.exp(np.fft.rfft(symc).real)
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 149, in mc2sp
symc[i] = c[i]
KeyboardInterrupt
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 143, in mc2sp
c = freqt(mc, int(fftlen // 2), -alpha)
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 70, in apply_along_last_axis
ret = func(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 80, in automatic_type_conversion
@decorator
KeyboardInterrupt
Traceback (most recent call last):
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 154, in mc2sp
return np.exp(np.fft.rfft(symc).real)
File "<__array_function__ internals>", line 6, in rfft
File "/opt/conda/lib/python3.7/site-packages/numpy/fft/_pocketfft.py", line 409, in rfft
output = _raw_fft(a, n, axis, True, True, inv_norm)
File "scripts/voice_change.py", line 78, in process
f_out = acoustic_converter.convert_loop(f_in_effective)
File "/opt/conda/lib/python3.7/site-packages/numpy/fft/_pocketfft.py", line 70, in _raw_fft
r = pfi.execute(a, is_real, is_forward, fct)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 142, in convert_loop
o_warp = self.convert(f)
KeyboardInterrupt
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 109, in convert
out = self.model(inputs).data[0]
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 148, in __call__
return self.decoder(self.encoder(x))
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 134, in __call__
h = self['c%d' % i](h)
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 70, in __call__
h = self.c(x)
File "/opt/conda/lib/python3.7/site-packages/chainer/link.py", line 242, in __call__
out = forward(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/chainer/links/connection/deconvolution_nd.py", line 150, in forward
outsize=self.outsize, dilate=self.dilate, groups=self.groups)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 377, in deconvolution_nd
y, = func.apply(args)
File "/opt/conda/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
outputs = self.forward(in_data)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 186, in forward
return self._forward_xp(x, W, b, numpy)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 83, in _forward_xp
return self._forward_xp_core(x, W, b, xp)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 128, in _forward_xp_core
gcol = xp.tensordot(W, x, (0, 1)).astype(x.dtype, copy=False)
Traceback (most recent call last):
Traceback (most recent call last):
Traceback (most recent call last):
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 169, in decode_spectrogram
alpha=pysptk.util.mcepalpha(self.out_sampling_rate),
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in mcepalpha
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 179, in <listcomp>
alpha in alpha_candidates]
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 193, in _warping_vector
num = (1 - alpha * alpha) * np.sin(omega)
File "scripts/voice_change.py", line 80, in process
f_out = acoustic_converter.decode_spectrogram(f_out)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 170, in decode_spectrogram
fftlen=fftlen,
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "scripts/voice_change.py", line 78, in process
f_out = acoustic_converter.convert_loop(f_in_effective)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
Traceback (most recent call last):
KeyboardInterrupt
File "<__array_function__ internals>", line 6, in tensordot
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 142, in convert_loop
o_warp = self.convert(f)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 109, in convert
out = self.model(inputs).data[0]
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 94, in automatic_type_conversion
return func(*args, **kwargs).astype(dtypein)
File "/opt/conda/lib/python3.7/site-packages/pysptk/conversion.py", line 150, in mc2sp
symc[-i] = c[i]
KeyboardInterrupt
File "/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py", line 1132, in tensordot
res = dot(at, bt)
File "<__array_function__ internals>", line 6, in dot
KeyboardInterrupt
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 148, in __call__
return self.decoder(self.encoder(x))
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 134, in __call__
h = self['c%d' % i](h)
File "/docker/ax/20211128_yukarin/yukarin/model.py", line 70, in __call__
h = self.c(x)
File "scripts/voice_change.py", line 74, in process
f_in = acoustic_converter.extract_acoustic_feature(w_in)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 67, in extract_acoustic_feature
dtype=self._param.dtype,
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 141, in extract
mc = pysptk.sp2mc(sp, order=order, alpha=alpha)
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 402, in apply_along_axis
buff[ind] = asanyarray(func1d(inarr_view[ind], *args, **kwargs))
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 231, in fun
args, kw = fix(args, kw, sig)
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 203, in fix
ba = sig.bind(*args, **kwargs)
File "/opt/conda/lib/python3.7/inspect.py", line 3015, in bind
return args[0]._bind(args[1:], kwargs)
File "/opt/conda/lib/python3.7/inspect.py", line 2944, in _bind
if param.kind == _VAR_POSITIONAL:
KeyboardInterrupt
File "/opt/conda/lib/python3.7/site-packages/chainer/link.py", line 242, in __call__
out = forward(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/chainer/links/connection/deconvolution_nd.py", line 150, in forward
outsize=self.outsize, dilate=self.dilate, groups=self.groups)
File "scripts/voice_change.py", line 74, in process
f_in = acoustic_converter.extract_acoustic_feature(w_in)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_converter.py", line 67, in extract_acoustic_feature
dtype=self._param.dtype,
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 141, in extract
mc = pysptk.sp2mc(sp, order=order, alpha=alpha)
File "/opt/conda/lib/python3.7/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/opt/conda/lib/python3.7/site-packages/pysptk/util.py", line 75, in apply_along_last_axis
ret = np.apply_along_axis(func, -1, *args, **kwargs)
File "<__array_function__ internals>", line 6, in apply_along_axis
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 401, in apply_along_axis
for ind in inds:
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/shape_base.py", line 370, in <genexpr>
inds = (ind + (Ellipsis,) for ind in inds)
File "/opt/conda/lib/python3.7/site-packages/numpy/lib/index_tricks.py", line 683, in __next__
def __next__(self):
KeyboardInterrupt
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 377, in deconvolution_nd
y, = func.apply(args)
File "/opt/conda/lib/python3.7/site-packages/chainer/function_node.py", line 263, in apply
outputs = self.forward(in_data)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 186, in forward
return self._forward_xp(x, W, b, numpy)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 83, in _forward_xp
return self._forward_xp_core(x, W, b, xp)
File "/opt/conda/lib/python3.7/site-packages/chainer/functions/connection/deconvolution_nd.py", line 128, in _forward_xp_core
gcol = xp.tensordot(W, x, (0, 1)).astype(x.dtype, copy=False)
File "<__array_function__ internals>", line 6, in tensordot
File "/opt/conda/lib/python3.7/site-packages/numpy/core/numeric.py", line 1132, in tensordot
res = dot(at, bt)
File "<__array_function__ internals>", line 6, in dot
KeyboardInterrupt
Traceback (most recent call last):
File "scripts/voice_change.py", line 85, in process
wave = f_out.decode(sampling_rate=sampling_rate, frame_period=frame_period)
File "/docker/ax/20211128_yukarin/yukarin/acoustic_feature.py", line 193, in decode
frame_period=frame_period,
ä¸æšã§ãoutput
ããŠãĢãé
ä¸ãĢã¯ãéŗåŖ°å¤æãããããŧãŋãæ ŧį´ãããĻããžãã
å¤æåĻįããä¸æãčĄããĻããããŽãããã°ãčĄããĻããĒãããŽããããã¨ããåŊĸã§ããããã§ãã
å°ãã§ããããŖãå¤æéŗåŖ°ãčããĻãŋãã¨ãåĻįŋããŧãŋãĢãããå¤æã¨ãĒããžãããããžãä¸æãčĄããĻããžããã§ããã
čĒãžããĻããæįĢ ãŽå
厚ããčãåããĒãããŽã大åã§ããã
čãåããããŽã§ãããããããčĒãã§ãããããĒãä¸čĒįļãĒããĨãĸãŗãšã§ãã
å°ãåĻįŋãĢäŊŋį¨ããéŗåŖ°ããŧãŋã¯ã23,398åãŽããĄã¤ãĢããããããĻãŗããŧãå ãĢããã°ã芹č æ°ãæãã397åãã¨æãããžãã ããŽãŦããĢã§å¤æãã§ããĒãã¨ãããã¨ã¯ãany-to-oneãŽå¤æãĢã¯ä¸åãã¨ãããã¨ãĒãŽãããããžããã
æãã¯ãæšããĻããŧãŋãįĸēčĒããĻãŋãã¨ãæĩˇå¤ãŽæšããĢãŋãŗãã§æĨæŦčĒã芹ãããĻããããŽããč¨ãééããĻããããŽãããŊããŊã¨čãåããĨããããŽãå¤é¨ãã¤ãēãæˇˇããŖãĻããããŽãįĄéŗãŽããŽįãåĻįŋãéŖããããĻããããŧãŋãæˇˇããŖãĻããæ§åã§ããã
ããã§ãčŠĻããĢãåčŗĒãŽč¯ãéŗåŖ°ããŧãŋã ããĢįĩãŖãĻãåãŗåĻįŋãåŽæŊããĻãŋããã¨æããžãã
å°ãVoiceVoxãĢãĻįæããéŗåŖ°ãĢã¤ããĻããã¤ãŗãããŧãˇã§ãŗãããããé¨åããæ¯čŧįå¤æ°åå¨ãããã¨ãĢæ°ãäģããžããã ã¤ãŗãããŧãˇã§ãŗãĢã¤ããĻã¯ãVoiceVoxãŽãĸããĒįãĢãĻčĒŋæ´ãå¯čŊãĒãŽã§ããããŗããŗããŠã¤ãŗãĢããããŧãĢåŽæŊã§ããã¨éŖããæŦĄįŦŦã§ãã ããŽįēãããŽįšã¯įŽãã¤ããŖãĻãåĻįŋãåŽæŊããžãã
åĻįŋããŧãŋãé¸åŽããįĩæã18,000į¨ãŽéŗåŖ°ããŧãŋãåé¤ãã5,894ãĢãžã§įĩããžããã ããŽããŧãŋãĢãĻãæšããĻååĻįãåŽæŊãŽä¸ãåĻįŋåĻįãåŽæŊããžãã
python scripts/extract_acoustic_feature.py \
-i './input_wav/*' \
-o './input_feature/'
python scripts/extract_acoustic_feature.py \
-i './target_wav/*' \
-o './target_feature/'
python scripts/extract_align_indexes.py \
-i1 './input_feature/*.npy' \
-i2 './target_feature/*.npy' \
-o './aligned_indexes/'
python scripts/extract_f0_statistics.py \
-i './input_feature/*.npy' \
-o './input_statistics.npy'
python scripts/extract_f0_statistics.py \
-i './target_feature/*.npy' \
-o './target_statistics.npy'
python train.py \
sample_config.json \
./model_stage1/
é¸åŽãčĄãŖãåĻįŋããŧãŋãĢãĻã13æéį¨åĻįŋãåãããĸããĢããĄã¤ãĢãįæããžããã
yukarin# tree model_stage1/
model_stage1/
âââ cg.dot
âââ config.json
âââ events.out.tfevents.1640679958.9e39f0c75923
âââ log
âââ predictor_10000.npz
âââ predictor_100000.npz
âââ predictor_110000.npz
âââ predictor_120000.npz
âââ predictor_130000.npz
âââ predictor_140000.npz
âââ predictor_150000.npz
âââ predictor_160000.npz
âââ predictor_170000.npz
âââ predictor_180000.npz
âââ predictor_190000.npz
âââ predictor_20000.npz
âââ predictor_200000.npz
âââ predictor_210000.npz
âââ predictor_220000.npz
âââ predictor_230000.npz
âââ predictor_240000.npz
âââ predictor_250000.npz
âââ predictor_260000.npz
âââ predictor_270000.npz
âââ predictor_280000.npz
âââ predictor_290000.npz
âââ predictor_30000.npz
âââ predictor_300000.npz
âââ predictor_310000.npz
âââ predictor_320000.npz
âââ predictor_330000.npz
âââ predictor_340000.npz
âââ predictor_350000.npz
âââ predictor_360000.npz
âââ predictor_370000.npz
âââ predictor_380000.npz
âââ predictor_390000.npz
âââ predictor_40000.npz
âââ predictor_400000.npz
âââ predictor_410000.npz
âââ predictor_420000.npz
âââ predictor_430000.npz
âââ predictor_440000.npz
âââ predictor_450000.npz
âââ predictor_460000.npz
âââ predictor_470000.npz
âââ predictor_480000.npz
âââ predictor_490000.npz
âââ predictor_50000.npz
âââ predictor_500000.npz
âââ predictor_510000.npz
âââ predictor_520000.npz
âââ predictor_530000.npz
âââ predictor_540000.npz
âââ predictor_550000.npz
âââ predictor_560000.npz
âââ predictor_570000.npz
âââ predictor_580000.npz
âââ predictor_590000.npz
âââ predictor_60000.npz
âââ predictor_600000.npz
âââ predictor_610000.npz
âââ predictor_620000.npz
âââ predictor_630000.npz
âââ predictor_640000.npz
âââ predictor_70000.npz
âââ predictor_80000.npz
âââ predictor_90000.npz
0 directories, 68 files
æįĩįãĒlogã¯äģĨä¸ã¨ãĒããžãã é¸åŽåãããæšåããĻãããããĢæããžãã
{
"predictor/mse": 0.26048386096954346,
"predictor/adversarial": 34.359676361083984,
"predictor/loss": 26.048383712768555,
"discriminator/real": 0.006752286572009325,
"discriminator/fake": 0.016222849488258362,
"discriminator/loss": 0.02297513745725155,
"discriminator/accuracy": 0.9975341796875,
"discriminator/precision": 0.9963241332156757,
"discriminator/recall": 0.9987548828125,
"test/predictor/mse": 0.3014129102230072,
"test/predictor/adversarial": 0.0002113436785293743,
"test/predictor/loss": 30.14129066467285,
"test/discriminator/real": 9.132438572123647e-05,
"test/discriminator/fake": 8.595941543579102,
"test/discriminator/loss": 8.596033096313477,
"test/discriminator/accuracy": 0.5,
"test/discriminator/precision": 0.5,
"test/discriminator/recall": 1.0,
"train/predictor/mse": 0.20771053433418274,
"train/predictor/adversarial": 0.00020970181503798813,
"train/predictor/loss": 20.771053314208984,
"train/discriminator/real": 0.0001445886300643906,
"train/discriminator/fake": 8.599419593811035,
"train/discriminator/loss": 8.599564552307129,
"train/discriminator/accuracy": 0.5,
"train/discriminator/precision": 0.5,
"train/discriminator/recall": 1.0,
"epoch": 13950,
"iteration": 641850,
"elapsed_time": 54962.55319662788
}
æ°ããĢįæãããĸããĢãĢãĻãåĻįŋããŧãŋãĢ寞ããĻãæšããĻããšããåŽæŊããžãã
python scripts/voice_change.py \
--model_dir './model_stage1' \
--config_path './model_stage1/config.json' \
--input_statistics 'input_statistics.npy' \
--target_statistics 'target_statistics.npy' \
--output_sampling_rate 24000 \
--disable_dataset_test \
--test_wave_dir './input_wav/' \
--output_dir './output/'
ããŽįĩæãåãŽįĩæããã¯æšåãããå°čąĄã¯ããããŽãŽã芹ããĻããå 厚ãčãåããĒãããŽã大åã§ããã å ˇäŊįãĢã¯ãäģĨä¸ããã°ã§ãŽãããŧãšææŗã§ãŽå¤æįĩæãã¨ããããŽãĢãããĒãåŖãå°čąĄã§ããã https://blog.hiroshiba.jp/voice-conversion-deep-leanring-and-other-delusions/
ãã ãäģĨä¸ããã°ãĢæ˛čŧãããĻããįĩæã¨ã¯ãæ¯čŧįčŋãå°čąĄãåããžããã ãããããã¯ãåŖãã¨ãæããžãã https://blog.hiroshiba.jp/voice-conversion-deep-leanring-and-other-delusions/
ããã§ãåĻįŋč¨åŽã§ãã sample_config.json
ãįēããĻãŋãžãã
åŽã¯ãyukarinãŽåĻįŋä¸ãGPUãŽãĄãĸãĒäŊŋį¨éã¯1GBæĒæēã§ããã
ã˛ããŖã¨ããã¨ããĒãĸãĢãŋã¤ã ã§ãŽåĻįãčæ
ŽããĻãčģŊããããã¯ãŧã¯æ§é ã¨ããĻãããŽãããããžããã
æšããĻãconfigå
厚ãįĸēčĒããĻãŋãã¨ãåšžã¤ãæ°ãĢãĒãįšãåå¨ããã
å
ããlossãŽé
åã¨ããĻãmse
ã100ãadversarial
ã0ã¨ãĒãŖãĻããããã§ããã
ããã¯ãissueãĢããŽæåŗãč¨čŧãããĻããžããã
mse
äģĨå¤ãŽlossããįŽįã¨åããåŊĸã§ãŽåčŗĒåä¸ãæãããã§ãã
ãã ãadversarial
ã1ãĢããĻãč¯ãã¨ãŽč¨čŧããããžãã
ããããčŠĻããĻãŋããã¨æããžãã
https://github.com/Hiroshiba/yukarin/issues/46
https://github.com/Hiroshiba/yukarin/issues/45
æŦĄãĢãstop_iteration
ã¨ããããŽãįĸēčĒããžããã
ããã§ãč¨åŽãããã°ãåĻįŋãŽįĩäēãŋã¤ããŗã°ãæåŽã§ããããŽã¨æãããžãã
500,000įãč¨åŽããĻãããã¨æããžãã
optimizer
ãŽč¨åŽãĢAdam
ãæåŽãããĻãããŽãįĸēčĒããžããã
ããĄãã¯ãRAdam
ãčŠĻããĻãŋããã¨æããžããããchainerãĢã¯RAdam
ãįĄãããã§ãããŽã§ãAdaBound
ãčŠĻããĻãŋããã¨æããžãã
ããããchainerãŽversionãå¤ããAdaBound
ãAttributeError: module 'chainer.optimizers' has no attribute 'AdaBound'
ã¨ãĒãŖãĻããžãŖãįēãããĄãã¯čĻéããã¨æããžãã
batchsize
ã§ããã128ãæåŽããĻãŋããã¨æããžãã
crop_size
ã¨ããé
įŽããããããĄãã¯äģĨä¸ãŽãŗãŧããĢããŖãĻã1æŦĄå
ãŽéŗåŖ°ããŧãŋãsplitããéãŽãããŧãŋéˇæåŽãŽããã§ããã
ããŽč¨åŽã¯ãããŧãŋã24,000Hzã§ãããã¨ãčãåžãã¨ãããå°ãéˇããĻãč¯ãæ°ãããžããŽã§ãdefaultã§512ãæåŽãããĻããããŽãã2048ã¸ã¨å¤æ´ããĻãŋããã¨æããžãã
start = random.randint(len_time - crop_size + 1)
return numpy.split(data, [start, start + crop_size], axis=time_axis)[1]
æåžãĢãæŦéĄãŽãããã¯ãŧã¯æ§æã§ãããããŖãĢãŋãŽéčĻéįã¯ãå
į¨ãŽissueįãããããĨãŧããŗã°ãããŖãããããĻããããã§ãããŽã§ãå¤æ´ããĒããããĢãããã¨æããžãã
ãããã¯ãŧã¯ããĒãããĢãããĢåŊããŖãĻã¯ãgenerator_base_channels
ãdiscriminator_base_channels
čžēããã¨æãããžããŽã§ããããã16åãĢããĻãŋããã¨æããžãã
generator_base_channels
īŧ 8â128discriminator_base_channels
īŧ 1â16ä¸č¨configãŽå¤æ´ãčĄãŖãä¸ã§ãååēĻåĻįŋãåŽæŊããžãã
ããã¨ãstop_iteration
ãč¨åŽãããã¨ãĢããŖãĻãåĻįé˛æãį¤ēãããã°ãŦãšããŧã襨į¤ēããããããĢãĒããžããã
åĻįåŽäēæéãŽäēæ¸ŦããããįēãæãéŖãæŦĄįŦŦã§ãã
yukarin# python train.py \
> sample_config.json \
> ./model_stage1/
total [#.................................................] 2.52%
this epoch [###########################################.......] 86.65%
12600 iter, 273 epoch / 500000 iterations
1.5014 iters/sec. Estimated time to finish: 3 days, 18:10:36.084940.
GPUãŽãĄãĸãĒäŊŋį¨éã¨ããĻãã13GBį¨ãäŊŋį¨ããããããĢãĒããžããã
batchsize
ãŽå¤§ããããčããã¨ãæčŋãŽãĒãããĒãĸããĢã¨æ¯čŧããĻãæąēããĻãĒããã¨ã¯ãããĒããã¨æããžãããåãæĨãã¯ãããŽæ§é ãĢãĻåãįĩãžããĻé ããã¨æããžãã
4æéį¨ãåĻįŋãåŽæŊããåžãåĻįŋããŧãŋãĢ寞ããĻããšããåŽæŊããĻãŋãžããã¨ãããæ¯čŧįããĒãčãåãããããĒãŖãĻããžããã ãžããåĻįŋããŧãŋé¸åŽãŽéãĢãåĻįŋããŧãŋããé¤ããããŧãŋãŽå ãåčŗĒã¨ããĻã¯åéĄãĒãããŽãããšãããŧãŋã¨ããĻãããšããčĄãŖãĻãŋãžããã¨ãããããĄããæ¯čŧįčãåãããããĒãŖãĻããžããã å°ãããŽããšãããŧãŋã¯ãåä¸ãŽčŠąč ãåĻįŋããŧãŋãĢåĢãžããĻããããåĢãžããĻããĒããŖããããããŽã¨ãĒããžãã
ãããã¯ãŧã¯æ§é ããĒãããĢãããã¨ã§ãåčŗĒãŽåä¸ãčĻčžŧããæåž ãæãĻãžãããŽã§ãããã§ãããå°ããããã¯ãŧã¯ãŽããŠãĄãŧãŋãŧãčĒŋæ´ãããã¨æããžãã
å ˇäŊįãĢã¯ãäģĨä¸ã§ãã
generator_base_channels
īŧ 128â256discriminator_base_channels
īŧ 16â32äŊĩããĻãbatchsize
ã§ãããååãĢæ¸ãããĻã64ãæåŽããĻãŋãžãã
ãžããããã°ãĢããã°ãadversarial
ãŽlosså˛åãéĢãããã¨ã芹č
æ§ãå¤ąãããã¨ãŽãã¨ã§ããããããã§ããĒãå°čąĄã§ããã
ãã¤ãadversarial
ãŽlosså˛åãéĢããããã¨ã§ã芹ããĻããå
厚ãŽåčŗĒãä¸ããã¨ãŽč¨čŧããããã°ãĢãããžãããŽã§ãadversarial
ã2ãĢããĻãŋããã¨æããžãã
ä¸č¨č¨åŽãĢãĻãä¸æŠãåĻįŋãåããĻãŋãžããã ããããĻįæããããĸããĢãĢãĻãããšãããŧãŋã§ãŽéŗåŖ°å¤æãåŽæŊããĻãŋãžããã ããããčŗããįĩæãĢã¯čŗããžããã§ããã å ãŽæ¯čŧįčãåããããįĩæãããŽé˛åąãįĄãåŊĸã§ããã
ããŽåžããissueãããã°å 厚ãåčãĢããĒãããåšžã¤ããŽããŠãĄãŧãŋãŧããĨãŧããŗã°ãčĄãŖãĻãŋãžããããå°ã æšåããį¨åēĻã§ããæ¯čŧįčãåãããããã¨ãããŦããĢãčļ ããããĒããŖãå°čąĄã§ãã 芹č ãŽå¤æã¯ãããŖããã§ããĻããå°čąĄã§ãã ãããã芹ããĻããå 厚ããããããĒã¨čãåããĨããåŊĸã§ãã æãã¯ãåŗå¯ãĢã¯ãå 厚ãčãåããããŽããããčãåããããĒãããŽãŽæšãå¤ãå°čąĄã§ãã
ããŽããããĒã¨ã¯čãåããĨããéŗåŖ°ããyukarinãŽįŦŦ2æŽĩéã§ããŦã¤ãĢãĒãå¯čŊæ§ã¯ããããã¨æããŽã§ãããįžæįšã§ãå 厚ãčãåããĨããããŽãä¸čģĸčãåãããããĒãã¨ãããŽã¯ãåäēēįãĢã¯éŖæåēĻãéĢããŽã§ã¯ãĒããã¨čããæŦĄįŦŦã§ãã
ãããããŽå¯žįãĢã¤ããĻã¯ãäžãã°ããããã¯ãŧã¯æ§é ãŽå¤æ´ããåĻįŋæšæŗãŽåˇĨå¤Ģįãæ¯čŧįæ šæŦįãĒ寞åŋãåŋ čĻã¨ãĒããŽã§ã¯ãĒããã¨æãããžãã æãã¯ãæ¯čŧįį°ĄæãĢ寞åŋã§ããããĒæšæŗãæãäģãã°ããããé æŦĄčŠĻãããĻé ãããã¨æããžãã
īŧåžįļäŊæĨãä¸æĻäŋįâĻīŧ
yukarinãŽåĻįŋåĻįãããŧãŋčģĸéãŽåž ãĄæéãĢãåæ§į įŠļãĢãããææ°ååãĢã¤ããĻãčĒŋãšãĻãŋãžããã
ææ°ãŽčĢæãĒãŠã§ãã¨ãįšãĢãany-to-any
ã§ãŽį įŠļįĩæãå¤ãããã§ããã
å°ãäģåãåŽčˇĩããããã¨ã¯ãany-to-one
ãĢãĒããžãã
any-to-any
ã¯ãany-to-one
ãŽä¸čŦåãĢãĒããžãããį˛žåēĻįēæŽãŽéŖæåēĻãåēæŦįãĢã¯éĢããĒãããŽã¨æãããžãã
åŊ芲issueãŽå čŋ°ãŗãĄãŗããĢããæŖæįãĢčĒŋæģįĩæãŽãĒãŗã¯ãč˛ŧãããĻé ããĻããžãããããŽä¸ã§ããäģååŽæŊããããã¨ãĢãããããĻããĻããã¤ãããĸįĩæãŽį˛žåēĻãéĢãããŽã¨ããĻãäģĨä¸ããããžããã
īŧany-to-any
īŧMediumVCīŧ
īŧany-to-one
īŧSingleVCīŧ
å°ãä¸č¨2ã¤ãŽãĒãã¸ããĒã¯ãäģĨä¸ãŽå ąéãã1čĢæãĢãĻãč§ŖčĒŦããžã¨ããããĻããžãã https://arxiv.org/pdf/2110.02500.pdf
čĢæãĢããã°ãSingleVCãčĄãŖãåžãMediumVCãčĄãã¨ãŽãã¨ã§ãã¤ãžããå
ã any-to-one
ãŽå¤æããããåžãone-to-any
ãŽå¤æãåŽæŊãããã¨ã§ãįĩæįãĢ any-to-any
ãåŽįžããã¨ãŽãã¨ã§ãã
čĢæä¸ã§ã¯ãä¸éįãĒ one
ãŽãã¨ããspecificspeaker speeches as the intermedium features(SSIF)
ã¨čĄ¨įžããĻããžãã
äģåã§č¨ãã°ãå ãã¯ãany-to-oneãåŽįžãããæŦĄįŦŦã§ããįēãSingleVCãį˛žåēĻéĢãã§ããã°ãįŽįãæããããã§ãã ããŽSingleVCãĢã¤ããĻã§ãããä¸č¨yukarinãŽåĻįŋãĢãį¨ããããĻé ããã
MozillaãįēčĄããĻãããĻããéŗåŖ°ããŧãŋãģãã īŧæĨæŦčĒãŽãããšãã¨ãããŽčĒãŋä¸ãéŗåŖ°ãåĢãžããįēãīŧ https://commonvoice.mozilla.org/ja/datasets
īŧįļããžãâĻīŧ
http://www.udialogue.org/ja/download-ja/cstr-vctk-corpus.html pip install pyrubberband apt-get install libsndfile1 https://akio-blogger.blogspot.com/2018/01/dockerubuntusndfile.html?m=1 ããščĒŋæ´ os.environ['CUDA_VISIBLE_DEVICES'] = '0' pip install transformers apt-get update -y apt-get install -y rubberband-cli
īŧååŋīŧ äģĨä¸ãåčãžã§ãŽslackã§ãŽããåãã¨ãĒããžãã https://axincai.slack.com/archives/C019HCVQBCP/p1641213162122700
HiroshibaããããäŊŋããĢãĒãããĻãããJVSãŗãŧããšããŧãŋãģãã ããåĻįŋããŧãŋãĢäŊŋį¨ãããã¨ãæ¤č¨ããã HiroshibaãããŽéŗåŖ°å¤æãéĢãį˛žåēĻã§åŽæŊãããĻãããããŧãŋãŽåčŗĒãŽéĢãã¨ããã¨ããã§ãæåž ãæãĻãããŽã¨čããĻãããžãã ãĩãŗããĢéŗåŖ°ãčããĻãŋãĻãŋãã¨ãMozillaãįēčĄããĻãããĻããéŗåŖ°ããŧãŋãģãããcommonvoiceãã¨æ¯ãšãĻãéŗčŗĒãããĒãã¯ãĒãĸã§ãããããĢæããžãã ãžãããcommonvoiceãã¨čĒãŋä¸ããĻãããããšãã¯åãããã§ãããã¯JSUTãŗãŧããšãĢæēããããŽãŽããã§ãã ã¤ãžããå¤æå ãŽããŧãŋãæ°ããĢäŊæããåŋ čĻã¯įĄãįēãåæãŽåšįã¯č¯ããã¨æãããžãã
ãžããéŗåŖ°čĒč -> éŗåŖ°åæ
ã¨ããæšæŗãĢã¤ããĻããčĒŋæģãåŽæŊããĻãŋããã¨æããžãã
https://github.com/Hiroshiba/realtime-yukarin mit voice conversion