YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 202 forks source link

Question about json file and label index #56

Closed TungyuYoung closed 2 years ago

TungyuYoung commented 2 years ago

Good day!

It is my pleasure to read your paper AST:Audio Spectrogram Transformer. You have done such an excellent job.

While I try to use your model for my own dataset, I meet a problem which is how to create the json file and label index, I don't quite understand the specific format and parameter configuration. My dataset consists of five categories with 1300 audios in each category. From your example in /egs/audioset/data/class_labels_indices.csv, I don't understand what 'mid' means. I'm just a newbie so I asked a such stupid question.

I would appreciate it if you could answer my questions patiently.

Yours sincerely.

YuanGongND commented 2 years ago

Hi there,

mid is just the term defined by AudioSet - FYI, it is not trivial to define the labels for audio events, so Google used something like a graph to learn what audio event labels are interesting, mid might just the name of the label node in the graph.

mid is used as labels in our training pipeline, it can be any unrepeated string, say you have five categories, it can be just cat001, cat002, ..., cat005 or anything you want. The third column, display_name is just for display purpose and can be any string.

I think the simplest way to build your json file is look at https://github.com/YuanGongND/ast/blob/7b2fe7084b622e540643b0d7d7ab736b5eb7683b/egs/speechcommands/prep_sc.py#L95-L117 or https://github.com/YuanGongND/ast/blob/master/egs/esc50/prep_esc50.py.

Please also check the readme file for audio length, input normalization. These are also important things of using AST.

-Yuan

YuanGongND commented 2 years ago

Also, I'd recommend running the ESC-50 recipe first if you are new to audio classification. That recipe is one click (just ./run_esc.sh), and should finish in a few hours with 4 GPUs, check https://github.com/YuanGongND/ast/issues/54#issuecomment-1073422922 if you only have one GPU.

Once you can reproduce the ESC-50 results, you can start from the ESC-50 recipe to adapt to your task.

-Yuan

TungyuYoung commented 2 years ago

Thanks for your enthusiastic answer. Your timely answer is very helpful to me, especially for a novice like me. I will keep studying hard. Have a good day :)

TungyuYoung commented 2 years ago

Thanks again. I have 4 GPUs RTX3090 for my project. I will try the ESC-50 first. Is there any problem that if I just replace my own dataset? My task is just a simple 5-classification project. I have noticed that Transformer performs better than the CNN on a large amount of data. Should I need to perform data augmentation before training?

YuanGongND commented 2 years ago

Thanks again. I have 4 GPU RTX3090 for my project.

That is sufficient to run the original recipe, nothing needs to change.

I will try the ESC-50 first. Is there any problem that if I just replace my own dataset? My task is a just simple 5-classification project.

There's something you need to take care of, e.g., the audio length, esc-50 is 5s at 16kHz, if yours are similar, you can re-use the hyperparameters of the esc-50 recipe, otherwise you need to search.

ESC-50 uses 5-fold cross-validation, you can just use a training/test split for your new task.

I have noticed that Transformer performs better than the CNN on a large amount of data. Should I need to perform data augmentation before training?

Our training recipe handles the data augmentation, and 1,300 samples are sufficient. The recipe uses AudioSet pretrained model.

-Yuan

TungyuYoung commented 2 years ago

I really appreciate your patient answer! your answer has been very helpful to me!

TungyuYoung commented 2 years ago

Dear Gong, I meet a new problem while I try ESC-50. I run the ./run_esc.sh and then there is an error occur: "fold 1: 1600 training samples, 400 test samples fold 2: 1600 training samples, 400 test samples fold 3: 1600 training samples, 400 test samples fold 4: 1600 training samples, 400 test samples fold 5: 1600 training samples, 400 test samples Finished ESC-50 Preparation

I download and unzip the dataset and put it into folder ./egs/egs50/data by myself beacause it seems to cannot download automactively.

pr1

If I delete the folder ./exp, it seems train successfully untill this problem occur. What's more, from another issue, the same problem happened. And I change the base_exp_dit to “

YuanGongND commented 2 years ago

You should follow the recipe and minimize the change. I don't know why esc-50 cannot be downloaded, but if you do it by yourself, the code will assume it has been already processed, and start training, which causes the issue.