YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

How to configure the dataset or modify the code if I want to do the one class binary classification #91

Open nanyyyyyy opened 1 year ago

nanyyyyyy commented 1 year ago

thanks

YuanGongND commented 1 year ago

Please folllow the ESC-50 recipe (50 class classification with AudioSet pretrained model) and just change https://github.com/YuanGongND/ast/blob/9e3bd9942210680b833b08c39d09f2284ddc4d1d/egs/esc50/run_esc.sh#L69

--n_class 2 for binary classification. I recommend to start with running ESC-50 recipe without any modification, it is one-click and automatically generate json datafiles. Once you can successfully run it (which means there's no environment issues and other problems), you can start modify from that.

-Yuan

nanyyyyyy commented 1 year ago

Please folllow the ESC-50 recipe (50 class classification with AudioSet pretrained model) and just change

https://github.com/YuanGongND/ast/blob/9e3bd9942210680b833b08c39d09f2284ddc4d1d/egs/esc50/run_esc.sh#L69

--n_class 2 for binary classification. I recommend to start with running ESC-50 recipe without any modification, it is one-click and automatically generate json datafiles. Once you can successfully run it (which means there's no environment issues and other problems), you can start modify from that.

-Yuan

I am feeding the model and constructing the dataset json file with this dictionary {'/m/spcmd00' : 0 , '/m/spcmd01':1}. and setting n_class equals to 2. 0 and 1 are the binary status of the only one class. Does this make sense? Thank you so much!! Great work by the way.

YuanGongND commented 1 year ago

This makes sense. But you also need to take care of the hyper-parameters, in particular, audio_length should be the max length of frames of audios in your dataset (e.g., 100 for 1s audio) timem should be about 20% of your average audio length, e.g., 25 for 1s audio. You also need to tune learning rate, etc.

1244547821 commented 1 year ago

I want to test esc-50 in windows, is it possible?

YuanGongND commented 1 year ago

I want to test esc-50 in windows, is it possible?

It might be possible if you have torch environment setup in Windows, though many things need to be changed in https://github.com/YuanGongND/ast/blob/master/egs/esc50/prep_esc50.py and https://github.com/YuanGongND/ast/blob/master/egs/esc50/run_esc.sh, and maybe somewhere else. An easier way might be use the Google Colab environment, I think it is OK for ESC-50 as it is small.

-Yuan

1244547821 commented 1 year ago

I want to test esc-50 in windows, is it possible?

It might be possible if you have torch environment setup in Windows, though many things need to be changed in https://github.com/YuanGongND/ast/blob/master/egs/esc50/prep_esc50.py and https://github.com/YuanGongND/ast/blob/master/egs/esc50/run_esc.sh, and maybe somewhere else. An easier way might be use the Google Colab environment, I think it is OK for ESC-50 as it is small.

-Yuan

ok, thanks for your answer.

1244547821 commented 1 year ago

Please folllow the ESC-50 recipe (50 class classification with AudioSet pretrained model) and just change https://github.com/YuanGongND/ast/blob/9e3bd9942210680b833b08c39d09f2284ddc4d1d/egs/esc50/run_esc.sh#L69

--n_class 2 for binary classification. I recommend to start with running ESC-50 recipe without any modification, it is one-click and automatically generate json datafiles. Once you can successfully run it (which means there's no environment issues and other problems), you can start modify from that. -Yuan

I am feeding the model and constructing the dataset json file with this dictionary {'/m/spcmd00' : 0 , '/m/spcmd01':1}. and setting n_class equals to 2. 0 and 1 are the binary status of the only one class. Does this make sense? Thank you so much!! Great work by the way.

Have you completed the binary classifications? I have some questions about modifying parameters. How did you modify --freqm, --timem, --tstride, --fstride, --audio_length?

YuanGongND commented 1 year ago

I usually suggest to first reproduce the ESC-50 recipe and then start modifying hyper-parameters, this helps you rule out other factors could impact the performance.

The hyper-parameter you listed are not related to number of classes:

audio_length should be your input audio length in frames, i.e., 1000 for 10-second audio; timem is the max mask augmentation on the time domain, should be around 20% of audio_length, e.g., 200; freqm is the max mask on the frequency domain, you can keep it same with the ESC-50 recipe. tstride and fstride are patch split stride, you should keep it same with the ESC-50 recipe.

You do need to modify --label-csv ./data/esc_class_labels_indices.csv --n_class 50 for a binary classification problem. label-csv should point to a csv contains only 2 labels, n_class should be 2.

-Yuan

1244547821 commented 1 year ago

I usually suggest to first reproduce the ESC-50 recipe and then start modifying hyper-parameters, this helps you rule out other factors could impact the performance.

The hyper-parameter you listed are not related to number of classes:

audio_length should be your input audio length in frames, i.e., 1000 for 10-second audio; timem is the max mask augmentation on the time domain, should be around 20% of audio_length, e.g., 200; freqm is the max mask on the frequency domain, you can keep it same with the ESC-50 recipe. tstride and fstride are patch split stride, you should keep it same with the ESC-50 recipe.

You do need to modify --label-csv ./data/esc_class_labels_indices.csv --n_class 50 for a binary classification problem. label-csv should point to a csv contains only 2 labels, n_class should be 2.

-Yuan

Thank you very much, I have modified it.

1244547821 commented 1 year ago

I usually suggest to first reproduce the ESC-50 recipe and then start modifying hyper-parameters, this helps you rule out other factors could impact the performance.

The hyper-parameter you listed are not related to number of classes:

audio_length should be your input audio length in frames, i.e., 1000 for 10-second audio; timem is the max mask augmentation on the time domain, should be around 20% of audio_length, e.g., 200; freqm is the max mask on the frequency domain, you can keep it same with the ESC-50 recipe. tstride and fstride are patch split stride, you should keep it same with the ESC-50 recipe.

You do need to modify --label-csv ./data/esc_class_labels_indices.csv --n_class 50 for a binary classification problem. label-csv should point to a csv contains only 2 labels, n_class should be 2.

-Yuan

I calculated dataset_mean=-6.6268077 and dataset_std=5.358466 of esc-50 are different from those in run_esc.sh. I don’t know where I went wrong. Could you please answer?

YuanGongND commented 1 year ago

What's your mean and std?

1244547821 commented 1 year ago

What's your mean and std?

mean=0.000238, std=0.000841. I feel that this is wrong. I see run.py help is the dataset spectrogram mean, so I converted it to fft calculation. So I would like to ask how to calculate this?

YuanGongND commented 1 year ago

Is this your own dataset? This is certainly not correct as the std should not be 0. You can check the issues to find how to cal the mean and std.

1244547821 commented 1 year ago

Is this your own dataset? This is certainly not correct as the std should not be 0. You can check the issues to find how to cal the mean and std.

Hi yuan, in my own two-category data set training, the Avg precision in each epoch is 0.5, Recall is always 1, can you answer my question, the following is my result. start validation acc: 0.934211 AUC: 0.981920 Avg Precision: 0.500000 Avg Recall: 1.000000 d_prime: 2.962967 train_loss: 0.260971 valid_loss: 0.466797 validation finished Epoch-2 lr: 1e-05