TaoRuijie / ECAPA-TDNN

Unofficial reimplementation of ECAPA-TDNN for speaker recognition (EER=0.86 for Vox1_O when train only in Vox2)
MIT License
581 stars 111 forks source link

questions about res2net in model.py #68

Closed rndlwjs closed 10 months ago

rndlwjs commented 10 months ago

First of all, thank you very much for sharing your code! This was very helpful in my study.

I have a question about the model architecture in model.py.

According to the official paper from the res2net, only the first scale block does not go through convolutional layer to reduce computational complexity.

In the model.py file, assuming the self.num(index) is 7, convs[] module list has 7 blocks. However, the final scale block seems to not go through conv module in line 72.

If I'm incorrect, please forgive my misunderstanding. However, I thought the last scale block would be important, because all the features are aggregated in the last block. I suppose it might create even better results.

TaoRuijie commented 10 months ago

I think this writing is correct.

for i in range(self.nums): if i==0: sp = spx[i] else: sp = sp + spx[i] sp = self.convsi sp = self.relu(sp) sp = self.bnsi if i==0: out = sp else: out = torch.cat((out, sp), 1)

for i = 0 ,sp = spx[i], sp go through conv layers, the output is sp then i = 1 to 6, sp = sp + spx[i], go into conv layers, the output is sp, cat into the out

It is noted that these 7 layers all go into conv, they are actually layer 1 to 7 in the paper.

Finally, line 72: out = torch.cat((out, spx[self.nums]),1), cat the output and the original input spx[7], here spx[7] did not go into conv, that is actually the first layer in the paper.

So the final output is 7 layers with conv + 1 layer output directly

rndlwjs commented 10 months ago

Oh, I see. Now I knew the line 72 was intentional. I probably was obsessed with the order of the scale, but it is not that important.

Thank you very much for your detail explanation. It was very helpful:)