andrewowens / multisensory

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features
http://andrewowens.com/multisensory/
Apache License 2.0
220 stars 60 forks source link

Question about fine-tune for full sep model #3

Open LionnelBall opened 6 years ago

LionnelBall commented 6 years ago

Really nice job!!! I have noticed that in the Self-supervised shift model, there is no gamma variable in slim.batch_norm for each conv layer(because there is no 'bn_scale' in shift_params.py). But as to the full speech-separation model, there is gamma in slim.batch_norm operation for each conv layer (‘bn_scale = True’ in sep_params.py). So, How could this full model be fine-tuned based on the shift model without gamma, since the 'gamma' differences exist in these two models respectively. If the weights in shift model and the corresponding weights in full model are the same, does the fine-tune make any sense?

andrewowens commented 6 years ago

Thanks! On this line: https://github.com/andrewowens/multisensory/blob/1bb54feaf76c8f50f4fc2aef189f807d12b576cb/src/sourcesep.py#L634 we specify that the gamma parameter should not be restored from the self-supervised network's checkpoint. Then, we re-initialize gamma to be approximately 1.

Sorry if that was confusing. In very early experiments, I was having trouble using the gamma parameter for the self-supervision task (it seemed to trend toward 0, and training would get stuck at chance performance), which is why I didn't use it there.

LionnelBall commented 6 years ago

Thanks for replying quickly that indeed solved my confusion! Another thing I am concerned is whether it is possible to make the separation model smaller while keep the performance much the same? less convolution kernel? or less layers?

andrewowens commented 6 years ago

I suggest decreasing the number of frequency bins in the STFT, and removing layers from the u-net model to compensate (e.g. by decreasing the frame_length_ms parameter). Other recent work (e.g. https://arxiv.org/pdf/1804.04121.pdf, https://arxiv.org/pdf/1804.03619.pdf) does fine with ~25% as many frequency bins. Hope that helps!