etzinis / sudo_rm_rf

Code for SuDoRm-Rf networks for efficient audio source separation. SuDoRm-Rf stands for SUccessive DOwnsampling and Resampling of Multi-Resolution Features which enables a more efficient way of separating sources from mixtures.
MIT License
307 stars 34 forks source link

Causal sudo rm rf model #23

Closed maldivesxue closed 1 year ago

maldivesxue commented 1 year ago

Q1: Activation in the encoder In Section 2.6 of your paper, it is mentioned that there are two differences between the causal version and the improved version of the "sudo rm -rf" model. However, upon reviewing the code,

        self.encoder = ScaledWSConv1d(in_channels=in_audio_channels,
                                      out_channels=enc_num_basis,
                                      kernel_size=enc_kernel_size * 2 - 1,
                                      stride=enc_kernel_size // 2,
                                      padding=(enc_kernel_size * 2 - 1) // 2,
                                      bias=False)
        torch.nn.init.xavier_uniform(self.encoder.weight)

I couldn't find any activation applied in the encoder. Could you please clarify whether there is an activation function that is implicitly applied or if there is any specific reason for not including an explicit activation in the encoder?

Q2: Kernel size discrepancy between encoder and decoder In the causal version, I noticed that the kernel size in the encoder is set to enc_kernel_size * 2 - 1, while in the decoder it is set to enc_kernel_size. However, in the improved version of the model, the kernel size between the encoder and decoder is the same as mentioned in the paper. I would like to understand the rationale behind the different kernel sizes in the causal version and the exact settings you used for the causal version.

Q3: Hyperparameters of the causal version Could you please provide the hyperparameters used in the causal version of the "sudo rm -rf" model? This includes parameters such as learning rates, batch sizes, number of blocks, number of channels, and any other relevant hyperparameters. Having this information would be helpful for reproducing and understanding the results.

Q4: Open-sourcing casual pretrained model I'm curious to know if there are any plans to open source some pretrained models for the causal version of the "sudo rm -rf" model. It would be valuable to have access to pretrained models to facilitate further research and applications. Could you share any insights or plans regarding this?

Thank you for your attention to these issues. I look forward to your response and clarification.

etzinis commented 1 year ago

Hey thanks for reaching out. Q1: I have made several experiments with and without the ReLU applied on top of the encoder activations but it is not so important both in terms of performance and computation-wise. Feel free to ignore that difference. Q2: Lol, this is probably why I was getting such a performance decline :P It should be enc_kernel_size to both the encoder and the decoder. Q3: Exactly the same as specified in the paper https://link.springer.com/article/10.1007/s11265-021-01683-x Q4: You are more than correct in this one, I should have done it but right now I am in the middle of moving out, thus, I might do that later. However, the cool thing with sudo models is that you can train it on one GPU in a day or two :)

maldivesxue commented 1 year ago

Thank you so much for you detaied reply. One more thing, the default setting of out_channels and in_channels in SuDORMRF are set to 128 and 512 separately. But according to your paper, it seems reverse. Is it a typo in the paper or we should change it manually? https://github.com/etzinis/sudo_rm_rf/blob/cd00f2e21f7ad6281360cdf24ade36f84b0fbad6/sudo_rm_rf/dnn/models/improved_sudormrf.py#LL225C1-L226C1

etzinis commented 1 year ago

I think it is the right order but try with 256 instead of 128