Open TungyuYoung opened 2 years ago
I check for the reson for a while. I printed the shape of fbank which is [128, 1024]. It seems that the spectrograms havn't trans to RGB channel. Would you please tell me how to fix it?
I cannot tell the reason either. But there's no RGB concept in the audio spectrogram. It is just 1-d information. [128,1024] means 128 frequency bins, 1024 time frames, which looks correct. The problem seems to happen in spectrogram masking.
Alright. I check line 191 in file ./dataloader, it seems that the problem happen in TimeMasking.
What's more, the average length of my dataset is only 2.5s. Do you think such a short audio will cause errors in theTimeMasking and affect the final performance of the model?
The length depends on input_tdim
, for your case, you should modify run.py
to set input_tdim=250
. timem
should be smaller than input_tdim
. Again, I suggest starting from either the speechcommands or esc50 recipe to get more familiar with the code.
Hi, Dr. Gong I comment the timem and the program worked successfully. For my own dataset, I split them for test and train. 20 percent of each class for test and the last for train. After training, I got a strange result as shown:
0.908529570359724 0.965062499999999 0.685776370522133 1 2.56357337530442
These are the value from wa_result.csv. I'm very confused about this. Besides, the RECALL performs value of 1 each epoch. And I try to use the inference.py to predict the audio file and it seems works well. I would appreciate it if you could tell me your train of thought to solve this problem.
Yours
I also encountered the same problem, could you please tell me how to solve it in detail,Thank you very much
I also encountered the same problem, could you please tell me how to solve it in detail,Thank you very much what exactly is your problem?
About start training: IndexError: tuple index out of range?
About start training: IndexError: tuple index out of range?
I commented it https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/dataloader.py#L192 https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/src/dataloader.py#L193 directly. Then I performed data augmentation directly before training and took the augmented data as input.
Okay, thank you very much. I'm trying
OK, I finally find the reason.
This is due to a torchaudio
issue. We use torchaudio 0.8.1
, in which the input of the masking can be [freq, time] while the newer version torchaudio
only accepts [1, freq, time].
I have fixed it with a workaround (works for both old and new torchaudio
) at https://github.com/YuanGongND/ast/blob/b7086755ebfd9f2ab018c0a40722b8418d9d41fe/src/dataloader.py#L190-L197
Your workaround (comment out the time-masking) might cause a problem of inaccurate masking span and lead to a performance drop (while might be small). I would suggest using our fixed code. It is very simple.
You can use the Colab script to find the bug https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/torchaudio_SpecMasking_1_1.ipynb
Hi, Dr.Gong, I use AST on my own dataset. I have created the .json file and .csv file according to the guide. However, when I run run.sh, an error occured and I was too stupid to fix it. The error is as shown below, I don't know how to solve it. I would appreciate it if you can tell me the reason.
Yours