YuanGongND / ast

Code for the Interspeech 2021 paper "AST: Audio Spectrogram Transformer".
BSD 3-Clause "New" or "Revised" License
1.06k stars 203 forks source link

The problem of reproducing the AST result in full dataset #85

Open MichaelLynn1996 opened 1 year ago

MichaelLynn1996 commented 1 year ago

Hi! Yaun Gong, Nice job! My research is also based on your pipeline. But I found that I can't reproduce the results of the AST of the paper under the full dataset. My dataset is downloaded from qiuqiangkong/audioset_tagging_cnn (github.com) (the size of unbalanced, balanced and eval set are 1912024, 20550, 18886 respectively), and pre-processed the the dataset following your instruction, including resampling to 16kHZ, generating weights files, and re-calculating mean and std (our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)

The results of the 5 epochs are: 0.405, 0.421, 0.434, 0.433, 0.433 Compare the results given in your source code: 4.153, 0.439, 0,448, 0.449, 0.449

The final ensemble result is 0.445, which is quiet difference to your result 0.459 in your paper. We checked that the hyperparameters of the experiment are the same as your code, can you give some advice on the occurrence of such a problem? Looking forward to your reply.

Thanks

你好,龚元, 很棒的工作!我的研究也是基于你的代码框架。但是我发现我没有办法在full dataset下复现论文的AST的结果。我的数据集来自qiuqiangkong/audioset_tagging_cnn (github.com) (unbalanced, balanced和eval的样本量分别为1912024,20550,18886), 并且按照你的指示对数据集进行预处理,包括重采样到到16kHZ、生成权重文件和重新计算mean和std(我们计算的mean和std分别为-3.539583和3.4221482,与你代码中的-4.2677393,4.5689974不一样)等。

5个epoch的结果分别为: 0.405, 0.421, 0.434, 0.433, 0.433 对比你源码中给出的结果: 4.153, 0.439, 0,448, 0.449, 0.449

最终集成学习的结果是0.445,对比论文的结果0.459有较大的差异。我们检查过实验的超参和你代码是一样的,对于出现这样的问题能给出一些意见吗?期待收到你的回复。

非常感谢

MichaelLynn1996 commented 1 year ago

Please feel free to reply just in English or Chinese.

YuanGongND commented 1 year ago

Hi Michael,

Thanks for reporting this. Without checking the actual code, it's hard for me to give specific suggestions. And I apologize that I don't have time to do debugging.

My dataset is downloaded from qiuqiangkong/audioset_tagging_cnn (github.com) (the size of unbalanced, balanced and eval set are 1912024, 20550, 18886 respectively)

I don't know about this release, but it seems to be slightly smaller than the version we have, which is directly downloaded from YouTube, please check https://arxiv.org/abs/2102.01243 for the exact number of our data. I don't think such data size difference would cause a large performance gap, especially on the full set. FYI, I also tested our model on a version collected independently with only about 90% samples, but the performance difference is very small.

pre-processed the the dataset following your instruction, including resampling to 16kHZ, generating weights files, and re-calculating mean and std (our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)

The fact that your mean and std are different with ours suggests your data format and/or processing is different from ours. This might be a problem. How did you resample the audio? How did you generate the weight?

The results of the 5 epochs are: 0.405, 0.421, 0.434, 0.433, 0.433 Compare the results given in your source code: 4.153, 0.439, 0,448, 0.449, 0.449

This is a very large difference on full AudioSet.

Are you using the exact same pipeline with us? Or you changed anything? Anything in the pipeline could lead to a performance change, please see our PSLA paper and this paper. You could also run the ESC-50 recipe and see if you get similar performance with us, so you can disentange if it is the data problem or the model/pipeline problem. Finally, there are a few repos (e.g., https://github.com/kkoutini/PaSST) with similar model with AST, you can run their code and see if you can reproduce there results.

Hope this helps.

-Yuan

MichaelLynn1996 commented 1 year ago

Hi Yuan, Thanks for your quickly reply and the more detailed information about your experiment.

I don't know about this release, but it seems to be slightly smaller than the version we have, which is directly downloaded from YouTube.

This release was shared by Kong. for reproducing the PANNs . The sample rate and format of audios is 32kHZ and wav. This release is used because it is difficult to access youtube in mainland China and it can be downloaded by baidu Netdisk. But I will try to complete the AudioSet recently in order to get close to your data.

How did you resample the audio? How did you generate the weight?

The code I use to resample the audio and generate training json file was modified from your psla code psla/prep_fsd.py at main · YuanGongND/psla (github.com) :

import pandas as pd  
import json  
import os  
from tqdm import tqdm  

# dataset downloaded from https://zenodo.org/record/4060432#.YXXR0tnMLfs
# please change it to your AudioSet dataset path  

data_path = '/workspace/datasets/AudioSet/'  

def get_immediate_files(a_dir):  
    return [name for name in os.listdir(a_dir) if os.path.isfile(os.path.join(a_dir, name))]  

# ['unbalanced_train_segments', 'balanced_train_segments', 'eval_segments'] for data_list  
data_list = ['unbalanced_train_segments', 'balanced_train_segments', 'eval_segments']  
resample = False  

if resample:  
    # convert all samples to 16kHZ,  
    # Don't forget install sox: sudo apt install sox    print('Now converting all AudioSet audio to 16kHz, this may take dozens of minutes.')  

    resample_cnt = 0  
    for data in data_list:  
        print('now processing', data)  
        base_path = data_path + data + '/'  
        target_path = data_path + data + '_16k/'  
        if not os.path.exists(target_path):  
            os.mkdir(target_path)  
        if data == 'unbalanced_train_segments':  
            unbalanced_fold = os.listdir(data_path + data + '/')  
            for fold_name in unbalanced_fold:  
                files = get_immediate_files(base_path + fold_name + '/')  
                for audiofile in files:  
                    if not os.path.exists(target_path + audiofile):  
                        os.system(  
                            'sox ' + base_path + fold_name + '/' + audiofile + ' -r 16000 ' + target_path + audiofile + '> /dev/null 2>&1')  
                    resample_cnt += 1  
                    if resample_cnt % 1000 == 0:  
                        print('Resampled {:d} samples.'.format(resample_cnt))  
        else:  
            files = get_immediate_files(base_path)  
            for audiofile in files:  
                if not os.path.exists(target_path + audiofile):  
                    os.system(  
                        'sox ' + base_path + audiofile + ' -r 16000 ' + target_path + audiofile + '> /dev/null 2>&1')  
                    resample_cnt += 1  
                if resample_cnt % 1000 == 0:  
                    print('Resampled {:d} samples.'.format(resample_cnt))  

    print('Resampling finished.')  
    print('--------------------------------------------')  

# create json datafiles for training, validation, and evaluation set  

if not os.path.exists('datafiles'):  
    os.mkdir('datafiles')  

if 'unbalanced_train_segments' in data_list:  
    # use the official training and validation set split.  
    unbalanced_path = data_path + 'unbalanced_train_segments.csv'  
    unbalanced_df = pd.read_csv(unbalanced_path, ', ', engine='python', skiprows=1, header=None)  
    unbalanced_df.columns = ['YTID', 'start_seconds', 'end_seconds', 'positive_labelsm']  
    # unbalanced_dirs = os.listdir(data_path + 'unbalanced_train_segments')  
    # print(unbalanced_df.head())  
    unbalanced_data = []  
    for row in tqdm(unbalanced_df.itertuples()):  
        # print(row)  
        # for i in range(len(unbalanced_dirs)):        #     wav_path = data_path + 'unbalanced_train_segments/' + unbalanced_dirs[i] + '/Y' + getattr(row, 'YTID') + '.wav'        #     # print(wav_path)        #     if os.path.exists(wav_path):        #         cur_dict = {"wav": wav_path, "labels": getattr(row, 'positive_labelsm').strip('"')}        #         # print(cur_dict)        #         unbalanced_data.append(cur_dict)        #         continue        wav_path = data_path + 'unbalanced_train_segments_16k/Y' + getattr(row, 'YTID') + '.wav'  
        if os.path.exists(wav_path):  
            cur_dict = {"wav": wav_path, "labels": getattr(row, 'positive_labelsm').strip('"')}  
            # print(cur_dict)  
            unbalanced_data.append(cur_dict)  

    with open('./datafiles/unbalanced_train_data.json', 'w') as f:  
        json.dump({'data': unbalanced_data}, f, indent=1)  
    print('Processed {:d} samples for the AudioSet unbalanced training set.'.format(len(unbalanced_data)))  

if 'balanced_train_segments' in data_list:  
    balanced_path = data_path + 'balanced_train_segments.csv'  
    balanced_df = pd.read_csv(balanced_path, ', ', engine='python', skiprows=1, header=None)  
    balanced_df.columns = ['YTID', 'start_seconds', 'end_seconds', 'positive_labelsm']  

    balanced_data = []  
    for row in tqdm(balanced_df.itertuples()):  
        # print(row)  
        wav_path = data_path + 'balanced_train_segments_16k/Y' + getattr(row, 'YTID') + '.wav'  
        if os.path.exists(wav_path):  
            cur_dict = {"wav": wav_path, "labels": getattr(row, 'positive_labelsm').strip('"')}  
            # print(cur_dict)  
            balanced_data.append(cur_dict)  

    with open('./datafiles/balanced_train_data.json', 'w') as f:  
        json.dump({'data': balanced_data}, f, indent=1)  
    print('Processed {:d} samples for the AudioSet balanced training set.'.format(len(balanced_data)))  

if 'eval_segments' in data_list:  
    eval_path = data_path + 'eval_segments.csv'  
    eval_df = pd.read_csv(eval_path, ', ', engine='python', skiprows=1, header=None)  
    eval_df.columns = ['YTID', 'start_seconds', 'end_seconds', 'positive_labelsm']  

    eval_data = []  
    for row in tqdm(eval_df.itertuples()):  
        # print(row)  
        wav_path = data_path + 'eval_segments_16k/Y' + getattr(row, 'YTID') + '.wav'  
        if os.path.exists(wav_path):  
            cur_dict = {"wav": wav_path, "labels": getattr(row, 'positive_labelsm').strip('"')}  
            # print(cur_dict)  
            eval_data.append(cur_dict)  

    with open('./datafiles/eval_data.json', 'w') as f:  
        json.dump({'data': eval_data}, f, indent=1)  
    print('Processed {:d} samples for the AudioSet eval training set.'.format(len(eval_data)))  

# (optional) create label enhanced set.  
# Go to /src/label_enhancement/

The code I use to generate the weight is exactly the same as your code ast/gen_weight_file.py at master · YuanGongND/ast (github.com) , just change the data_path to my path.

Are you using the exact same pipeline with us? Or you changed anything? Anything in the pipeline could lead to a performance change, please see our PSLA paper and this paper. You could also run the ESC-50 recipe and see if you get similar performance with us, so you can disentange if it is the data problem or the model/pipeline problem.

I can reproduce the result in ESC-50 and SpeechCommandV2 recipe, even in balanced dataset. And I check every hyperparameter by comparing your github code but negligence may not be guaranteed because the plenty of former experiment conduct in your pipeline.

So, I will do serval jobs and feedback after it is done:

  1. complete the dataset as above
  2. re-download ast code
  3. re-caculate the norm and std(Is it necessary?)
  4. conduct the experiment

It might take a few days and thanks your warm-hearted help!

YuanGongND commented 1 year ago

This sounds a reasonable plan.

re-caculate the norm and std(Is it necessary?)

I don't mean you need to recalculate them, but the fact that your norm and std are difference from us indicates your data is different from us.

-Yuan

wisekimm commented 11 months ago

Hi! Yaun Gong, Nice job! My research is also based on your pipeline. But I found that I can't reproduce the results of the AST of the paper under the full dataset. My dataset is downloaded from qiuqiangkong/audioset_tagging_cnn (github.com) (the size of unbalanced, balanced and eval set are 1912024, 20550, 18886 respectively), and pre-processed the the dataset following your instruction, including resampling to 16kHZ, generating weights files, and re-calculating mean and std (our mean and std are -3.539583 and 3.4221482, which are not the same as -4.2677393, 4.5689974 in your code)

The results of the 5 epochs are: 0.405, 0.421, 0.434, 0.433, 0.433 Compare the results given in your source code: 4.153, 0.439, 0,448, 0.449, 0.449

The final ensemble result is 0.445, which is quiet difference to your result 0.459 in your paper. We checked that the hyperparameters of the experiment are the same as your code, can you give some advice on the occurrence of such a problem? Looking forward to your reply.

Thanks

你好,龚元, 很棒的工作!我的研究也是基于你的代码框架。但是我发现我没有办法在full dataset下复现论文的AST的结果。我的数据集来自qiuqiangkong/audioset_tagging_cnn (github.com) (unbalanced, balanced和eval的样本量分别为1912024,20550,18886), 并且按照你的指示对数据集进行预处理,包括重采样到到16kHZ、生成权重文件和重新计算mean和std(我们计算的mean和std分别为-3.539583和3.4221482,与你代码中的-4.2677393,4.5689974不一样)等。

5个epoch的结果分别为: 0.405, 0.421, 0.434, 0.433, 0.433 对比你源码中给出的结果: 4.153, 0.439, 0,448, 0.449, 0.449

最终集成学习的结果是0.445,对比论文的结果0.459有较大的差异。我们检查过实验的超参和你代码是一样的,对于出现这样的问题能给出一些意见吗?期待收到你的回复。

非常感谢

Hi, Michael Lynn 1996,

Thank you for asking a good question to Yuan Gong. It helped me a lot.

By the way, I downloaded AudioSet data from the same place as you(qiuqiangkong/audioset_tagging_cnn (github.com)) and reproduced AST. However, for the full-set, the results are similar to the values you listed above. (mean: -3.3834944, std: 3.8869045, mAP: 0.408, 0.425, 0.434, 0.433, 0.433)

Have you solved the problem?

Thanks :)

gevmin94 commented 9 months ago

@YuanGongND I would like to know if it is possible to share your downloaded version, similar to what qiuqiangkong/audioset_tagging_cnn did, in order to reproduce and compare the results.

YuanGongND commented 9 months ago

hi @gevmin94,

Unfortunately we cannot, this won't pass our Institute review.

However, if you download the PANNs version, you can reproduce the result. Check this:

https://github.com/YuanGongND/ast/issues/108

-Yuan

gevmin94 commented 9 months ago

Thanks for your prompt reply @YuanGongND.

I'm facing difficulties in downloading the PANNs version from outside China. Obtaining an account on Baidu, which is necessary for access, has proven challenging. Despite my efforts, I couldn't create an account outside China. Additionally, I attempted to use 'yt-dlp' to crawl the data, currently approximately 25% of the files are not being downloaded due to availibility. I'm wondering if you might have any suggestions or assistance in gaining access to the PANNs version.

YuanGongND commented 9 months ago

@gevmin94

For the first question, I cannot help as I don't use Baidu. Maybe you can ask the authors.

For the second, missing 25% seems to be a large value. But we also don't have the full version, I guess we downloaded something around 90%.

https://www.dropbox.com/s/18hoeq92juqsg2g/audioset_2m_cleaned.json?dl=1 This contains the ids (no audio) we used in another project, which is slightly smaller than AST but should be close.

-Yuan