YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

About start training: IndexError: tuple index out of range. #58

Open TungyuYoung opened 2 years ago

TungyuYoung commented 2 years ago

Hi, Dr.Gong, I use AST on my own dataset. I have created the .json file and .csv file according to the guide. However, when I run run.sh, an error occured and I was too stupid to fix it. The error is as shown below, 3e219a85af871b195637717af2c5309 I don't know how to solve it. I would appreciate it if you can tell me the reason.

Yours

TungyuYoung commented 2 years ago

I check for the reson for a while. I printed the shape of fbank which is [128, 1024]. It seems that the spectrograms havn't trans to RGB channel. Would you please tell me how to fix it?

YuanGongND commented 2 years ago

I cannot tell the reason either. But there's no RGB concept in the audio spectrogram. It is just 1-d information. [128,1024] means 128 frequency bins, 1024 time frames, which looks correct. The problem seems to happen in spectrogram masking.

TungyuYoung commented 2 years ago

Alright. I check line 191 in file ./dataloader, it seems that the problem happen in TimeMasking.

TungyuYoung commented 2 years ago

What's more, the average length of my dataset is only 2.5s. Do you think such a short audio will cause errors in theTimeMasking and affect the final performance of the model?

YuanGongND commented 2 years ago

The length depends on input_tdim, for your case, you should modify run.py to set input_tdim=250. timem should be smaller than input_tdim. Again, I suggest starting from either the speechcommands or esc50 recipe to get more familiar with the code.

TungyuYoung commented 2 years ago

Hi, Dr. Gong I comment the timem and the program worked successfully. For my own dataset, I split them for test and train. 20 percent of each class for test and the last for train. After training, I got a strange result as shown:

0.908529570359724 0.965062499999999 0.685776370522133 1 2.56357337530442

These are the value from wa_result.csv. I'm very confused about this. Besides, the RECALL performs value of 1 each epoch. And I try to use the inference.py to predict the audio file and it seems works well. I would appreciate it if you could tell me your train of thought to solve this problem.

Yours

Mxnet123 commented 2 years ago

I also encountered the same problem, could you please tell me how to solve it in detail,Thank you very much

TungyuYoung commented 2 years ago

I also encountered the same problem, could you please tell me how to solve it in detail,Thank you very much what exactly is your problem?

Mxnet123 commented 2 years ago

About start training: IndexError: tuple index out of range?

TungyuYoung commented 2 years ago

About start training: IndexError: tuple index out of range?

I commented it https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/dataloader.py#L192 https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/dataloader.py#L193 directly. Then I performed data augmentation directly before training and took the augmented data as input.

Mxnet123 commented 2 years ago

Okay, thank you very much. I'm trying

YuanGongND commented 2 years ago

OK, I finally find the reason.

This is due to a torchaudio issue. We use torchaudio 0.8.1, in which the input of the masking can be [freq, time] while the newer version torchaudio only accepts [1, freq, time].

I have fixed it with a workaround (works for both old and new torchaudio) at https://github.com/YuanGongND/ast/blob/b7086755ebfd9f2ab018c0a40722b8418d9d41fe/src/dataloader.py#L190-L197

Your workaround (comment out the time-masking) might cause a problem of inaccurate masking span and lead to a performance drop (while might be small). I would suggest using our fixed code. It is very simple.

YuanGongND commented 2 years ago

You can use the Colab script to find the bug https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/torchaudio_SpecMasking_1_1.ipynb