eurecom-asp / rawnet2-antispoofing

This repository includes the code to reproduce our paper "End-to-end anti-spoofing with RawNet2" (https://arxiv.org/abs/2011.01108) published in ICASSP '21.
MIT License
49 stars 13 forks source link

Strange Sinc Layer #6

Open Blinorot opened 1 year ago

Blinorot commented 1 year ago

Dear author,

I was trying to understand how the sinc-layer in your code works. Could you, please, explain two lines in this part:

# initialize filterbanks using Mel scale
        NFFT = 512
        f=int(self.sample_rate/2)*np.linspace(0,1,int(NFFT/2)+1)

        if freq_scale == 'Mel':
            fmel=self.to_mel(f) # Hz to mel conversion
            fmelmax=np.max(fmel)
            fmelmin=np.min(fmel)
            filbandwidthsmel=np.linspace(fmelmin,fmelmax,self.out_channels+2)
            filbandwidthsf=self.to_hz(filbandwidthsmel) # Mel to Hz conversion
            self.freq=filbandwidthsf[:self.out_channels]

1) Why do we need f=int(self.sample_rate/2)*np.linspace(0,1,int(NFFT/2)+1)? It seems that fmelmax is always equal to self.to_mel(int(self.sample_rate/2)) and fmelmin is always equal to self.to_mel(0). We just do not use the fact that f is a linspace.

2) Why do we need +2 in filbandwidthsmel=np.linspace(fmelmin,fmelmax,self.out_channels+2)? It seems that self.freq=filbandwidthsf[:self.out_channels] does not include the two highest frequencies because of this line. I could not find a note about that in the paper.

Thank you.

GeWanying commented 12 months ago

Hi @Blinorot, as per my understanding:

  1. Since the calculation of f is only to get the fixed and predetermined fmelmax and fmelmin, as you mentioned, it is correct, but indeed redundant to define f for such a purpose.

  2. The two highest frequencies bands are not included.

Sinc layer in this work is slightly different from the original SincNet (here) where the code has low_hz and high_hz standing for lowest and highest frequencies of the mel scale (librosa documentation here). So the default mel scale obtained in this repo has a lowest frequency of 0, and a highest frequency of whatever filbandwidthsf has at self.out_channels, rather than self.out_channels+2.

Blinorot commented 12 months ago

So the default mel scale obtained in this repo has a lowest frequency of 0, and a highest frequency of whatever filbandwidthsf has at self.out_channels, rather than self.out_channels+2.

Hello, thank you, I understand this. However, the question is why do we choose the highest frequency like that? I think it is important question because:

1) In ASVspoof 2021 baseline version of RawNet2 the highest frequency is always sr/2 . 2) It is not obvious why the highest frequency should depend on the value of self.out_channels. 3) If we change the highest frequency to sr/2 by fixing this +2, the linear-scale and inverse mel-scale versions of RawNet2 in this repository start to overfit and do not produce the results depicted in the paper. If we do not remove this +2, model works well. So this +2 is important, but I did not find anything about this in the paper.