andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
http://andrewowens.com/multisensory/
Apache License 2.0
220 stars 60 forks source link

question about sourcesep training result on new dataset #21

Open xiaoyiming opened 5 years ago

xiaoyiming commented 5 years ago
I tried to train the sourcesep.py on  a  new data-set.  the data-set contain 12000 videos and trained about 2000 iteration.  the training results are as followed:

Iteration 0, lr = 1e-04, total:gen: 1.038 gen:reg: 0.155 diff-fg: 0.556 phase-fg: 0.006 diff-bg: 0.316 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 105.432 Iteration 1, lr = 1e-04, total:gen: 1.037 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 104.403 Iteration 2, lr = 1e-04, total:gen: 1.036 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 103.402 Iteration 3, lr = 1e-04, total:gen: 1.035 gen:reg: 0.155 diff-fg: 0.554 phase-fg: 0.006 diff-bg: 0.314 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 102.648 Iteration 4, lr = 1e-04, total:gen: 1.033 gen:reg: 0.155 diff-fg: 0.553 phase-fg: 0.006 diff-bg: 0.313 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 101.729 Iteration 5, lr = 1e-04, total:gen: 1.030 gen:reg: 0.155 diff-fg: 0.551 phase-fg: 0.006 diff-bg: 0.312 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.953 Iteration 6, lr = 1e-04, total:gen: 1.028 gen:reg: 0.155 diff-fg: 0.550 phase-fg: 0.006 diff-bg: 0.311 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.038 Iteration 7, lr = 1e-04, total:gen: 1.024 gen:reg: 0.155 diff-fg: 0.547 phase-fg: 0.006 diff-bg: 0.310 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 99.307 Iteration 8, lr = 1e-04, total:gen: 1.021 gen:reg: 0.155 diff-fg: 0.545 phase-fg: 0.006 diff-bg: 0.309 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 98.419 Iteration 9, lr = 1e-04, total:gen: 1.017 gen:reg: 0.155 diff-fg: 0.542 phase-fg: 0.006 diff-bg: 0.308 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 97.764 Iteration 10, lr = 1e-04, total:gen: 1.013 gen:reg: 0.155 diff-fg: 0.539 phase-fg: 0.006 diff-bg: 0.307 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 96.905 Iteration 20, lr = 1e-04, total:gen: 0.967 gen:reg: 0.155 diff-fg: 0.507 phase-fg: 0.006 diff-bg: 0.294 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 89.464 Iteration 30, lr = 1e-04, total:gen: 0.922 gen:reg: 0.154 diff-fg: 0.475 phase-fg: 0.006 diff-bg: 0.281 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000, time: 82.757 Iteration 40, lr = 1e-04, total:gen: 0.877 gen:reg: 0.153 diff-fg: 0.444 phase-fg: 0.006 diff-bg: 0.268 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000, ..... Iteration 1800, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.358 Iteration 1810, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.319 Iteration 1820, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.505 Iteration 1830, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.447 Iteration 1840, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.346 Iteration 1850, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.312 Iteration 1860, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.403 Iteration 1870, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.404 Iteration 1880, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.202 Iteration 1890, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.469 Iteration 1900, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.318 Iteration 1910, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.159 Iteration 1920, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.241 Iteration 1930, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.028 Iteration 1940, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.877 Iteration 1950, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.739 Iteration 1960, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.555 Iteration 1970, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.00 Iteration 1980, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.283 Iteration 1990, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.177 Checkpoint: /home/zhang/xiao/multisensory-master/data/traing/sep_2s_test/net.tf-2000 Iteration 2000, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.973 Iteration 2010, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.958 Iteration 2020, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.821 Iteration 2030, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.806 Iteration 2040, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.881 As shown in the results, the training loss decreases . However, when the trained results are used to separate the video through the sep_video.py. we can only get the noise. could you give me some advises?

xuanhanyu commented 5 years ago

After I read the comments above, I noticed that the author said need to rewrite the I/O code. If I rewrite the I/O code, Should I read video and audio data separately, and then fed to two branch networks ? Or Convert data to TF format. When I rewrite the I/O code, where are details needs to be noticed. Looking forward to your reply and help me solve my uncertainty. Thank you very much!

YiyuLuo commented 5 years ago

Hi, I'm wondering have you fixed the problems now? I come across the same trouble.

xuanhanyu commented 5 years ago

I am sorry.

发自我的iPhone

------------------ Original ------------------ From: YiyuLuo notifications@github.com Date: Mon,May 27,2019 3:17 PM To: andrewowens/multisensory multisensory@noreply.github.com Cc: qpmnh 1209656621@qq.com, Comment comment@noreply.github.com Subject: Re: [andrewowens/multisensory] question about sourcesep training result on new dataset (#21)

Hi, I'm wondering have you fixed the problems now? I come across the same trouble.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

andrewowens commented 4 years ago

It's hard to answer this without knowing more (if you're still having this problem). Did you train on VoxCeleb? Is the output really random noise, or just incorrect?

As for the I/O: you can do it either way (with a TFRecord or reading the audio and video through some other process). The code just expects a batch of audio-visual pairs.

ruizewang commented 3 years ago

Hello I'm not sure what the meaning of these losses, could you please explain it for me? Iteration 2040, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.881

andrewowens commented 3 years ago

Hi,

diff-fg = L1 loss of on-screen spectrogram magnitude diff-bg = L1 loss of off-screen spectrogram magnitude phase-{fg,bg} = same as above, but for the spectrogram phase reg: weight decay total:gen: sum of losses

You can ignore the "discrim" (it's for a GAN loss that it isn't actually used in our paper).

ruizewang commented 3 years ago

Thanks for your quick reply and kindly help @andrewowens 💯