TaoRuijie / ECAPA-TDNN

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)
MIT License
581 stars 111 forks source link

Misalignment of "model.py" code with the original paper: layer 2 should only take the output of layer 1 as input, and layer 3 should only take the output of layer 2 as input #59

Closed ja8143912 closed 1 year ago

ja8143912 commented 1 year ago

Dear Author,

I noticed that lines 179-183 in the "model.py" file do not adhere to the specifications outlined in the original paper (Fig. 2, available at https://arxiv.org/pdf/2005.07143.pdf). According to the paper, layer 2 should only take the output of layer 1 as input, and layer 3 should only take the output of layer 2 as input.

The current implementation in lines 179-183 of "model.py" is as follows: x1 = self.layer1(x) x2 = self.layer2(x + x1) x3 = self.layer3(x + x1 + x2) x = self.layer4(torch.cat((x1, x2, x3), dim=1))

However, the paper suggests the following structure: x1 = self.layer1(x) x2 = self.layer2(x1) x3 = self.layer3(x2) x = self.layer4(torch.cat((x1, x2, x3), dim=1))

Thank you for your attention to this matter.

TaoRuijie commented 1 year ago

Hi, When I read the paper I understand their struture as the way I write the code. I also see the code write as the method you mean. I perfer the first one by my understanding so I use it.

Btw, I also tried your structure before and get similar result as I remember. That finding might can be useful to you.

Thank you