dr-pato / audio_visual_speech_enhancement

Face Landmark-based Speaker-Independent Audio-Visual Speech Enhancement in Multi-Talker Environments
https://dr-pato.github.io/audio_visual_speech_enhancement/
Apache License 2.0
106 stars 25 forks source link

On the issue of model training #32

Open lzyhub opened 1 year ago

lzyhub commented 1 year ago

Here is my training log Model one python av_speech_enhancement.py training --data_dir ./data3 --train_set TF/initial/TRAINING_SET --val_set TF/initial/VALIDATION_SET --exp 538 --mode fixed --audio_dim 257 --video_dim 136 --num_audio_samples 48000 --model av_concat_mask --opt adam --learning_rate 0.00001 --updating_step 100 --learning_decay 1.0 --batch_size 8 --epochs 10 --hidden_units 300 --layers 5 --dropout 0.5 --regularization 0 is executed by me then i obtain the following training logs +-- EXPERIMENT NUMBER - 538 --+

optimizer: adam

number of hidden layers (other): 5

number of hidden units: 300

initial learning rate: 0.001000

regularization: 0.000000

dropout keep probability (no dropout if 1): 0.500000

training size: 96

validation size: 24

batch size: 8

approx number of steps: 120

approx number of steps per epoch: 12

Epoch LR Train[Cost|L2-SPEC|SNR|SDR|SIR|SAR] Val[Cost|L2-SPEC|SNR|SDR|SIR|SAR] 0 [0.001000][31570019.88236|1709225.63460|-13.12816|0.18333|0.55682|17.89409] [1975891.51622|158995.56250|0.00015|-0.51246|-0.03308|14.50995] 1 [0.001000][2636708.38466|225585.16838|-0.00033|0.21334|0.56131|17.34965] [1924917.82028|154274.00000|-0.00150|1.40241|1.53057|19.73535] 2 [0.001000][2140906.74620|152231.50675|-0.12245|0.50816|0.61324|21.01079] [1127515.98006|89183.06250|-0.26976|2.36294|2.43035|23.03181] 3 [0.001000][1722309.84827|103804.95866|-0.22557|0.25222|0.29530|23.95723] [880734.01623|75290.49219|-0.20167|0.26640|0.31491|23.15888] 4 [0.001000][1519283.73804|87616.62866|-0.16919|0.23113|0.28726|22.74769] [777480.35918|71305.63281|-0.16929|0.54480|0.59922|22.95043] 5 [0.001000][1443420.57966|86095.03585|-0.17768|0.32173|0.36417|24.09551] [767667.54653|71888.60156|-0.18584|0.81632|0.85498|24.58647] 6 [0.001000][1400788.72074|85683.11458|-0.20549|0.32629|0.37772|23.44259] [727436.30309|68974.75781|-0.23568|1.59682|1.64908|23.65309] 7 [0.001000][1373982.10674|85067.37577|-0.21270|0.32271|0.36764|23.89874] [717339.86095|68491.68750|-0.26695|1.11602|1.17088|23.10323] 8 [0.001000][1328416.01693|81485.82133|-0.26452|0.32197|0.37998|22.72361] [708650.96137|68050.54688|-0.29006|0.68897|0.74409|22.81329] 9 [0.001000][1302718.35020|80514.23511|-0.26054|0.31579|0.36549|23.40774] [729637.53720|69496.58594|-0.21461|0.96021|1.00348|24.11726] 10 [0.000100][1281801.58735|82254.72701|-0.22952|0.33012|0.37483|23.95600] [708477.12235|68036.82812|-0.27129|1.03879|1.08591|23.76498] 11 [0.000100][1275627.79961|79284.57194|-0.29610|0.32261|0.36980|23.63214] [698638.96618|67444.65625|-0.32302|0.84844|0.89582|23.60907] 12 [0.000100][1267918.67399|79293.26168|-0.29410|0.33657|0.38365|23.64460] [701850.88089|67644.44531|-0.30031|0.90971|0.95642|23.71244] 13 [0.000100][1269136.25578|79197.69096|-0.29316|0.33315|0.37936|23.71920] [701300.80841|67630.75000|-0.30473|0.84122|0.88648|23.81616] 14 [0.000100][1260134.10445|79627.33919|-0.27986|0.33905|0.38384|23.87830] [703340.61619|67773.17969|-0.29527|0.81707|0.86083|23.96222] 15 [0.000100][1261865.75887|79141.21179|-0.29393|0.33404|0.37916|23.84493] [699860.65311|67549.57812|-0.30905|0.83365|0.87842|23.87263] 16 [0.000100][1252664.10120|79255.17277|-0.29193|0.33842|0.38439|23.75614] [699368.99738|67490.92969|-0.30399|0.93634|0.98177|23.86592] 17 [0.000100][1253516.99967|79154.08610|-0.29474|0.33794|0.38333|23.81782] [697506.26763|67394.03906|-0.31475|0.89868|0.94387|23.85658] 18 [0.000100][1246361.52609|79118.05432|-0.29539|0.34490|0.38981|23.91663] [696388.32105|67310.22656|-0.31471|0.94217|0.98734|23.88559] 19 [0.000100][1240519.73664|78834.66744|-0.30179|0.34777|0.39270|23.88205] [694920.96278|67218.67969|-0.32202|0.91532|0.95951|23.97451] 20 [0.000100][1246233.74705|78866.22001|-0.30074|0.34691|0.39048|24.02649] [694796.32193|67250.82031|-0.32852|0.84411|0.88889|23.84819] 21 [0.000100][1233846.03362|78229.35763|-0.31758|0.34438|0.38991|23.75011] [692615.70580|67100.71875|-0.33782|0.86155|0.90657|23.83718] 22 [0.000010][1233310.14358|78084.23238|-0.32119|0.34684|0.39194|23.81030] [692574.29280|67102.21094|-0.33903|0.84609|0.89084|23.85644] 23 [0.000010][1237074.54296|78170.68587|-0.31848|0.34786|0.39243|23.87702] [692993.30367|67117.94531|-0.33470|0.86234|0.90634|23.95278] 24 [0.000010][1232479.62149|78292.84021|-0.31485|0.34975|0.39376|23.95652] [693375.77422|67135.53906|-0.33133|0.86902|0.91236|24.03470]

After many epochs, the function loss is still very large, about one million, and the SDR parameters are not very good, making the model unable to separate speech. May I ask if the training parameters are incorrect, and is it normal for me to have such a large loss? How should I train If the author could help, I would be very grateful