Alibaba-MIIL / AudioClassfication

MIT License
75 stars 13 forks source link

The parameter amount is reduced and the performance is unchanged #2

Closed mjq2020 closed 2 years ago

mjq2020 commented 2 years ago

Hello, I achieved the effect in the paper after training your model. At the same time, after I replaced the Transformer while reducing the channels of each layer of convolution, the number of model parameters was only 40k. It is found that it reduces the ACC by about 0.9 percentage points on the UrbanSound8K dataset. The simultaneous use of background noise on the SpeechCommands dataset has currently reached 95.26, and training continues. Have you ever tested with such a small model? So I think Transformer's role may not be large, and its parameters are large. You can try to use smaller models that are easier to deploy to edge devices.

mjq2020 commented 2 years ago

At the same time, I found a bug where the code data is loaded. This bug makes the data loading speed very slow. I will submit a fix later.

avi33 commented 2 years ago

Hello, I achieved the effect in the paper after training your model. At the same time, after I replaced the Transformer while reducing the channels of each layer of convolution, the number of model parameters was only 40k. It is found that it reduces the ACC by about 0.9 percentage points on the UrbanSound8K dataset. The simultaneous use of background noise on the SpeechCommands dataset has currently reached 95.26, and training continues. Have you ever tested with such a small model? So I think Transformer's role may not be large, and its parameters are large. You can try to use smaller models that are easier to deploy to edge devices.

Hi, Thanks for this observation. Yes, during the ablation (which was conducted only on ESC-50) it is noticed that the transformer increases accuracy at the expense of complexity overhead. If it is worth it it depends on your application. Since we aimed to be comparable to other SOTA models we preferred to use this over for example, the average pooling layer. Additionally you can try to mitigate the reduction by lowering the feedforward dimension in the TF encoder, I have used 512 (the default in pytorch is 2048). Hopefully, this will have less of an impact on results. If you wish to deploy on edge device i would suggest to make use the RepVGG https://arxiv.org/abs/2101.03697 trick for residual blocks, it will lower the memory footprint

avi33 commented 2 years ago

At the same time, I found a bug where the code data is loaded. This bug makes the data loading speed very slow. I will submit a fix later.

was not aware on torchaudio loading process - nice

mjq2020 commented 2 years ago

At the same time, I found a bug where the code data is loaded. This bug makes the data loading speed very slow. I will submit a fix later.

was not aware on torchaudio loading process - nice

After testing, I found that the slow time is mainly caused by the AudioAugs class, which will make the model take longer for each epoch after training. Sometimes an epoch can reach several hours or even longer. Using torchaudio to load will consume less CPU and memory than librosa (sr=None).