ZhongshuHou / Personalized-Speech-Enhancement-Demo

This is a demo page of current ongoing personalized speech enhancement (pSE) project. The speaker embedding is generated through the fbank and mfcc features of enrollment speech.
4 stars 1 forks source link

some questions of the model #2

Open wendongj opened 8 months ago

wendongj commented 8 months ago

thanks for the demo of your team on-going research, from the issue of other classmate, I find the two papers you refer is same with me, I have some confused question in reproducing the code in the paer 1: as you said in the project introduction, the fbank and mfcc are used as embedding, in X-TF-GridNet, the embedding is extracted through a u2net network, in your team paper in dns2023, the embedding is extracted through ecapa-tdnn, in dns2023, one team use fbank together with embedding extracted ecapa-tdnn as embedding, did you not use the embedding extracted by additional network, and just use fbank and mfcc as embedding? 2: as you mentioned, your structure is similar with X-TF-GridNet, smees you use the speaker attentive module in "PERSONALIZED SPEECH ENHANCEMENT COMBINING BAND-SPLIT RNN AND SPEAKER ATTENTIVE MODULE" to fuse the embedding with the enhancement network? I have reproduced this structure as followings, and the result is poor, can you help to check if it is right?

class SAM(nn.Module):
    def __init__(self, C1, C, emb_dimension, **kwargs):
        super(SAM, self).__init__(**kwargs)
        self.C1 = C1
        self.conv_2 = nn.Conv2d(C, C1, kernel_size=(3, 3), stride=(1, 1))
        self.bn_2 = nn.BatchNorm2d(C1, eps=1e-8)
        self.act_2 = nn.PReLU(C1)

        self.conv_3 = nn.Conv2d(C, C, kernel_size=(1, 1), stride=(1, 1))
        self.bn_3 = nn.BatchNorm2d(C, eps=1e-8)
        self.act_3 = nn.PReLU(C)

        self.fc = nn.Linear(emb_dimension, C1)

    def forward(self, x, emb):
        # x.shape = (Bs, C, T, F)
        # x.shape = (Bs, emb_dimension)
        B,C,T,F = x.shape
        x_1 = torch.nn.functional.pad(x, pad=(1, 1, 2, 0))
        x_1 = self.conv_2(x_1)
        x_1 = self.bn_2(x_1)
        x_1 = self.act_2(x_1) # B C1 T F

        emb = self.fc(emb) # B C1

        q = x_1.permute(0,2,3,1) # B T F C1
        k = emb.unsqueeze(1).unsqueeze(3).repeat(
            1, T, 1, 1) # B T C1 1
        s = torch.softmax(torch.matmul(q,k) / math.sqrt(F*self.C1/2), dim=-1)
        s = s.repeat(
            1, 1, 1, C) # B T F C
        s = s.permute(0,3,1,2) # B C T F
        x = self.act_3(self.bn_3(self.conv_3(s*x))) + x # B H T F

        return x
ZhongshuHou commented 8 months ago

Thank you for your attention.

  1. As for the generation of speaker embedding, our conducted experiments revealed that additional network and fbank both obtained effective embedding and similar performance. Hence, we choose to use fbank to generate embedding for less network complexity. Specifically, the fbank is generated through python package kaldi.fbank.    2. As for the network details of speaker attentive module, please forgive me for not being able to disclose too many details because the current project design is a business secret. What I can say is that your code looks reasonable and perhaps you can first verify whether your training data is clean enough, such as issues related to the quality of enrollment speech.

Zhongshu Hou

------------------ Original ------------------ From: @.>; Date:  Tue, Nov 21, 2023 02:44 PM To: @.>; Cc: @.***>; Subject:  [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)

 

thanks for the demo of your team on-going research, from the issue of other classmate, I find the two papers you refer is same with me, I have some confused question in reproducing the code in the paer 1: as you said in the project introduction, the fbank and mfcc are used as embedding, in X-TF-GridNet, the embedding is extracted through a u2net network, in your team paper in dns2023, the embedding is extracted through ecapa-tdnn, in dns2023, one team use fbank together with embedding extracted ecapa-tdnn as embedding, did you not use the embedding extracted by additional network, and just use fbank and mfcc as embedding? 2: as you mentioned, your structure is similar with X-TF-GridNet, smees you use the speaker attentive module in "PERSONALIZED SPEECH ENHANCEMENT COMBINING BAND-SPLIT RNN AND SPEAKER ATTENTIVE MODULE" to fuse the embedding with the enhancement network? I have reproduced this structure as followings, and the result is poor, can you help to check if it is right?

class SAM(nn.Module): def init(self, C1, C, emb_dimension, kwargs): super(SAM, self).init(kwargs) self.C1 = C1 self.conv_2 = nn.Conv2d(C, C1, kernel_size=(3, 3), stride=(1, 1)) self.bn_2 = nn.BatchNorm2d(C1, eps=1e-8) self.act_2 = nn.PReLU(C1) self.conv_3 = nn.Conv2d(C, C, kernel_size=(1, 1), stride=(1, 1)) self.bn_3 = nn.BatchNorm2d(C, eps=1e-8) self.act_3 = nn.PReLU(C) self.fc = nn.Linear(emb_dimension, C1) def forward(self, x, emb): # x.shape = (Bs, C, T, F) # x.shape = (Bs, emb_dimension) B,C,T,F = x.shape x_1 = torch.nn.functional.pad(x, pad=(1, 1, 2, 0)) x_1 = self.conv_2(x_1) x_1 = self.bn_2(x_1) x_1 = self.act_2(x_1) # B C1 T F emb = self.fc(emb) # B C1 q = x_1.permute(0,2,3,1) # B T F C1 k = emb.unsqueeze(1).unsqueeze(3).repeat( 1, T, 1, 1) # B T C1 1 s = torch.softmax(torch.matmul(q,k) / math.sqrt(Fself.C1/2), dim=-1) s = s.repeat( 1, 1, 1, C) # B T F C s = s.permute(0,3,1,2) # B C T F x = self.act_3(self.bn_3(self.conv_3(sx))) + x # B H T F return x
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.***>

wendongj commented 7 months ago

Thank you for your attention. 1. As for the generation of speaker embedding, our conducted experiments revealed that additional network and fbank both obtained effective embedding and similar performance. Hence, we choose to use fbank to generate embedding for less network complexity. Specifically, the fbank is generated through python package kaldi.fbank.    2. As for the network details of speaker attentive module, please forgive me for not being able to disclose too many details because the current project design is a business secret. What I can say is that your code looks reasonable and perhaps you can first verify whether your training data is clean enough, such as issues related to the quality of enrollment speech. Zhongshu Hou ------------------ Original ------------------ From: @.>; Date:  Tue, Nov 21, 2023 02:44 PM To: @.>; Cc: @.>; Subject:  [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)   thanks for the demo of your team on-going research, from the issue of other classmate, I find the two papers you refer is same with me, I have some confused question in reproducing the code in the paer 1: as you said in the project introduction, the fbank and mfcc are used as embedding, in X-TF-GridNet, the embedding is extracted through a u2net network, in your team paper in dns2023, the embedding is extracted through ecapa-tdnn, in dns2023, one team use fbank together with embedding extracted ecapa-tdnn as embedding, did you not use the embedding extracted by additional network, and just use fbank and mfcc as embedding? 2: as you mentioned, your structure is similar with X-TF-GridNet, smees you use the speaker attentive module in "PERSONALIZED SPEECH ENHANCEMENT COMBINING BAND-SPLIT RNN AND SPEAKER ATTENTIVE MODULE" to fuse the embedding with the enhancement network? I have reproduced this structure as followings, and the result is poor, can you help to check if it is right? class SAM(nn.Module): def init(self, C1, C, emb_dimension, kwargs): super(SAM, self).init(kwargs) self.C1 = C1 self.conv_2 = nn.Conv2d(C, C1, kernel_size=(3, 3), stride=(1, 1)) self.bn_2 = nn.BatchNorm2d(C1, eps=1e-8) self.act_2 = nn.PReLU(C1) self.conv_3 = nn.Conv2d(C, C, kernel_size=(1, 1), stride=(1, 1)) self.bn_3 = nn.BatchNorm2d(C, eps=1e-8) self.act_3 = nn.PReLU(C) self.fc = nn.Linear(emb_dimension, C1) def forward(self, x, emb): # x.shape = (Bs, C, T, F) # x.shape = (Bs, emb_dimension) B,C,T,F = x.shape x_1 = torch.nn.functional.pad(x, pad=(1, 1, 2, 0)) x_1 = self.conv_2(x_1) x_1 = self.bn_2(x_1) x_1 = self.act_2(x_1) # B C1 T F emb = self.fc(emb) # B C1 q = x_1.permute(0,2,3,1) # B T F C1 k = emb.unsqueeze(1).unsqueeze(3).repeat( 1, T, 1, 1) # B T C1 1 s = torch.softmax(torch.matmul(q,k) / math.sqrt(Fself.C1/2), dim=-1) s = s.repeat( 1, 1, 1, C) # B T F C s = s.permute(0,3,1,2) # B C T F x = self.act_3(self.bn_3(self.conv_3(sx))) + x # B H T F return x — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.>

really thanks for your reply, for 2, I will check data, for 1, as my understanding, the fbank is average pooling in the frame dimension? so the dimension of fbank is [B D], where D is the embedding size? so, if the model is similar with X-TF-GridNet, then the speaker encoder is removed now? or the fbank is used as input for speaker model, not Real and imag part of reference which is used in X-TF-GridNet, and the speaker encoder is exist.

ZhongshuHou commented 7 months ago

The dimension of fbank feature is [B, D], and we do not use an additional speaker encoder to generate embedding. We directly use the normalized fbank statistics as the speaker embedding, which is sent to the attentive-fusion model.      ------------------ Original ------------------ From: @.>; Date:  Wed, Nov 29, 2023 02:09 PM To: @.>; Cc:  "Zhongshu @.>; @.>; Subject:  Re: [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)

 

Thank you for your attention. 1. As for the generation of speaker embedding, our conducted experiments revealed that additional network and fbank both obtained effective embedding and similar performance. Hence, we choose to use fbank to generate embedding for less network complexity. Specifically, the fbank is generated through python package kaldi.fbank.    2. As for the network details of speaker attentive module, please forgive me for not being able to disclose too many details because the current project design is a business secret. What I can say is that your code looks reasonable and perhaps you can first verify whether your training data is clean enough, such as issues related to the quality of enrollment speech. Zhongshu Hou … ------------------ Original ------------------ From: @.>; Date:  Tue, Nov 21, 2023 02:44 PM To: @.>; Cc: @.>; Subject:  [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)   thanks for the demo of your team on-going research, from the issue of other classmate, I find the two papers you refer is same with me, I have some confused question in reproducing the code in the paer 1: as you said in the project introduction, the fbank and mfcc are used as embedding, in X-TF-GridNet, the embedding is extracted through a u2net network, in your team paper in dns2023, the embedding is extracted through ecapa-tdnn, in dns2023, one team use fbank together with embedding extracted ecapa-tdnn as embedding, did you not use the embedding extracted by additional network, and just use fbank and mfcc as embedding? 2: as you mentioned, your structure is similar with X-TF-GridNet, smees you use the speaker attentive module in "PERSONALIZED SPEECH ENHANCEMENT COMBINING BAND-SPLIT RNN AND SPEAKER ATTENTIVE MODULE" to fuse the embedding with the enhancement network? I have reproduced this structure as followings, and the result is poor, can you help to check if it is right? class SAM(nn.Module): def init(self, C1, C, emb_dimension, kwargs): super(SAM, self).init(kwargs) self.C1 = C1 self.conv_2 = nn.Conv2d(C, C1, kernel_size=(3, 3), stride=(1, 1)) self.bn_2 = nn.BatchNorm2d(C1, eps=1e-8) self.act_2 = nn.PReLU(C1) self.conv_3 = nn.Conv2d(C, C, kernel_size=(1, 1), stride=(1, 1)) self.bn_3 = nn.BatchNorm2d(C, eps=1e-8) self.act_3 = nn.PReLU(C) self.fc = nn.Linear(emb_dimension, C1) def forward(self, x, emb): # x.shape = (Bs, C, T, F) # x.shape = (Bs, emb_dimension) B,C,T,F = x.shape x_1 = torch.nn.functional.pad(x, pad=(1, 1, 2, 0)) x_1 = self.conv_2(x_1) x_1 = self.bn_2(x_1) x_1 = self.act_2(x_1) # B C1 T F emb = self.fc(emb) # B C1 q = x_1.permute(0,2,3,1) # B T F C1 k = emb.unsqueeze(1).unsqueeze(3).repeat( 1, T, 1, 1) # B T C1 1 s = torch.softmax(torch.matmul(q,k) / math.sqrt(Fself.C1/2), dim=-1) s = s.repeat( 1, 1, 1, C) # B T F C s = s.permute(0,3,1,2) # B C T F x = self.act_3(self.bn_3(self.conv_3(sx))) + x # B H T F return x — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.**>

really thanks for your reply, for 2, I will check data, for 1, as my understanding, the fbank is average pooling in the frame dimension? so the dimension of fbank is [B D], where D is the embedding size? so, if the model is similar with X-TF-GridNet, then the speaker encoder is removed now? or the fbank is used as input for speaker model, not Real and imag part of reference which is used in X-TF-GridNet, and the speaker encoder is exist.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

wendongj commented 7 months ago

The dimension of fbank feature is [B, D], and we do not use an additional speaker encoder to generate embedding. We directly use the normalized fbank statistics as the speaker embedding, which is sent to the attentive-fusion model.      ------------------ Original ------------------ From: @.>; Date:  Wed, Nov 29, 2023 02:09 PM To: @.>; Cc:  "Zhongshu @.>; @.>; Subject:  Re: [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)   Thank you for your attention. 1. As for the generation of speaker embedding, our conducted experiments revealed that additional network and fbank both obtained effective embedding and similar performance. Hence, we choose to use fbank to generate embedding for less network complexity. Specifically, the fbank is generated through python package kaldi.fbank.    2. As for the network details of speaker attentive module, please forgive me for not being able to disclose too many details because the current project design is a business secret. What I can say is that your code looks reasonable and perhaps you can first verify whether your training data is clean enough, such as issues related to the quality of enrollment speech. Zhongshu Hou … ------------------ Original ------------------ From: @.>; Date:  Tue, Nov 21, 2023 02:44 PM To: @.>; Cc: @.>; Subject:  [ZhongshuHou/Personalized-Speech-Enhancement-Demo] some questions of the model (Issue #2)   thanks for the demo of your team on-going research, from the issue of other classmate, I find the two papers you refer is same with me, I have some confused question in reproducing the code in the paer 1: as you said in the project introduction, the fbank and mfcc are used as embedding, in X-TF-GridNet, the embedding is extracted through a u2net network, in your team paper in dns2023, the embedding is extracted through ecapa-tdnn, in dns2023, one team use fbank together with embedding extracted ecapa-tdnn as embedding, did you not use the embedding extracted by additional network, and just use fbank and mfcc as embedding? 2: as you mentioned, your structure is similar with X-TF-GridNet, smees you use the speaker attentive module in "PERSONALIZED SPEECH ENHANCEMENT COMBINING BAND-SPLIT RNN AND SPEAKER ATTENTIVE MODULE" to fuse the embedding with the enhancement network? I have reproduced this structure as followings, and the result is poor, can you help to check if it is right? class SAM(nn.Module): def init(self, C1, C, emb_dimension, kwargs): super(SAM, self).init(kwargs) self.C1 = C1 self.conv_2 = nn.Conv2d(C, C1, kernel_size=(3, 3), stride=(1, 1)) self.bn_2 = nn.BatchNorm2d(C1, eps=1e-8) self.act_2 = nn.PReLU(C1) self.conv_3 = nn.Conv2d(C, C, kernel_size=(1, 1), stride=(1, 1)) self.bn_3 = nn.BatchNorm2d(C, eps=1e-8) self.act_3 = nn.PReLU(C) self.fc = nn.Linear(emb_dimension, C1) def forward(self, x, emb): # x.shape = (Bs, C, T, F) # x.shape = (Bs, emb_dimension) B,C,T,F = x.shape x_1 = torch.nn.functional.pad(x, pad=(1, 1, 2, 0)) x_1 = self.conv_2(x_1) x_1 = self.bn_2(x_1) x_1 = self.act_2(x_1) # B C1 T F emb = self.fc(emb) # B C1 q = x_1.permute(0,2,3,1) # B T F C1 k = emb.unsqueeze(1).unsqueeze(3).repeat( 1, T, 1, 1) # B T C1 1 s = torch.softmax(torch.matmul(q,k) / math.sqrt(Fself.C1/2), dim=-1) s = s.repeat( 1, 1, 1, C) # B T F C s = s.permute(0,3,1,2) # B C T F x = self.act_3(self.bn_3(self.conv_3(sx))) + x # B H T F return x — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.Message ID: @.> really thanks for your reply, for 2, I will check data, for 1, as my understanding, the fbank is average pooling in the frame dimension? so the dimension of fbank is [B D], where D is the embedding size? so, if the model is similar with X-TF-GridNet, then the speaker encoder is removed now? or the fbank is used as input for speaker model, not Real and imag part of reference which is used in X-TF-GridNet, and the speaker encoder is exist. — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: **@.***>

I see, thanks for your reply, ^_^