Training and model details for the released pre-trained model

keunwoochoi commented 4 years ago

Hi, after asking a few questions I'm still not 100% sure how the released model was trained (that was used for these samples https://drive.google.com/open?id=1CafFnqWn_QvVPu2feNLn6pnjRYIa_rbP).

Was it, at the end, audioset or not? It was conv1d not conv2d, right? I'd really love to reproduce it but can't really figure out perfectly. Can you kindly provide more details?

Thanks :)

AppleHolic commented 4 years ago

@keunwoochoi

I'm sure with default code to reproduce result that is uploaded checkpoint, validation and test samples. I got a test with default source code and it reproduced completly. And There is no conv2d code on my repository.

But, the problem is being with audioset. Few days ago, I find out audioset_augmentor source comes not work from sometime that after I did download audioset. Many requests in short time is prevented with youtube-dl package. I think youtube add protection logic to prevent crawling their videos or some prevention code is added to that package. So, If I cannot find other way to download youtube audios, audioset_augment will be deprecated.

Finally, summarizing out that issue, I can share only checkpoint file, validation and test samples that trained with audioset files in general sound case. But it is still able to reproduce singing voice separation.

By the way, at your time for getting audioset files, it could be possible to download audios.

Let's check it out, Is it sure your the number of audio files are about 18k (it could be reduced with license issue, downloading audio can be rejected with updating license) of human-class after downloading and pre-processing audios?

keunwoochoi commented 4 years ago

your the number of audio files are about 18k

Yes, I also had the issue with youtube-dl but managed to work out on my laptop and I got about 18k files.

And thanks for the answer but the training data is still unclear to me. When you were training the uploaded checkpoint, was the training data Voicebank, audioset, or both?

AppleHolic commented 4 years ago

This code line is augmention point that using audioset.

Ok, then, you should check it below. Is it prints out correct number of audios?

from audioset_augmentor.augmentor import AUDIO_LIST
print(len(AUDIO_LIST))

keunwoochoi commented 4 years ago

Hm, I understood the code. But my question is about which dataset was used exactly for the released model - AudioSet? Voicebank? I'm confused because the readme reads like Audioset was used, but previously when we were chatting you mentioned Audioset actually didn't seem to help.

To answer your question, I don't actually know, because I didn't run your code as it is. I run this code below to get 12147 non-speech audio files from the balanced set. Not sure if it helps but anyway..


def gnu_main(meta_path: str = '../assets/unbalanced_train_segments.csv'):

    # load config
    meta_info = get_audio_info(meta_path)

    # make args
    args_list = []

    human_filenames, nonhuman_filenames = [], []

    noise_human_filenames = []
    noise_nonhuman_filenames = []

    human_ids = collect_all_ids()
    noise_ids = collect_all_ids('/m/096m7z')
    for item in meta_info.iterrows():
        renamed_yt_id = item[1]['YTID'].replace('-', 'XX')
        tag_id = item[1]['positive_labels'].split(',')
        is_human_sample = any([i in human_ids for i in tag_id])
        is_noise_sample = any([i in noise_ids for i in tag_id])

        if is_human_sample:
            human_filenames.append(renamed_yt_id)
        else:
            nonhuman_filenames.append(renamed_yt_id)

        if is_noise_sample:
            if is_human_sample:
                noise_human_filenames.append(renamed_yt_id)
            else:
                noise_nonhuman_filenames.append(renamed_yt_id)

    list_to_txt(human_filenames, 'unbalanced_audioset_speech_ids.txt')
    list_to_txt(nonhuman_filenames, 'unbalanced_audioset_nonspeech_ids.txt')

    list_to_txt(noise_human_filenames, 'unbalanced_audioset_noise_human_ids.txt')
    list_to_txt(noise_nonhuman_filenames, 'unbalanced_audioset_noise_nonhuman_ids.txt')

    print(len(human_filenames), len(nonhuman_filenames), len(noise_human_filenames), len(noise_nonhuman_filenames))
    # balanceed set: 2351 19809 32 96
    # unbalanced set: 128997 1912792 732 1252

def gnu_further():
    my_filenames = txt_to_list('filenames_all.txt')
    nonspeech_filenames = set(txt_to_list('balanced_audioset_nonspeech_ids.txt'))

    # import ipdb; ipdb.set_trace()
    print(len(my_filenames))
    my_filenames = [f for f in my_filenames if f.split('/')[1].replace('.wav', '') in nonspeech_filenames]
    print(len(my_filenames))

    list_to_txt(my_filenames, 'filenames.txt')
    list_to_txt(my_filenames[:-500], 'filenames_train.txt')
    list_to_txt(my_filenames[-500:], 'filenames_test.txt')
    # 2019-10-24. ugh, out of 13549, now i only use 12147 non-speech ones as noise signals. meaning, 10% of them was speech.

AppleHolic commented 4 years ago

The number of downloaded audioset can reduced with license issues (youtube-dl can only download video that has not license issues) or network issues.

As I ran again experiments, audioset helps inference quality exactly. And It is trained with voice bank.

Hmm.. In another view, inference code can generate similar result with pre-trained checkpoint? If it can generate samples rightly, it seems be checked out voice bank processing codes.

keunwoochoi commented 4 years ago

Ok, voicebank only. Thanks! About 1-2 months ago I tried to load your checkpoint, but there was something wrong. I think it was a layer name mismatch but don't remember exactly. I'll try it soon and let you know.

AppleHolic commented 4 years ago

Ah not only model is trained on voice bank but also audioset augmentation is contained!

Hmm.. I think pre-trained checkpoint should be tested other environment. If you cannot generate similar result with pre-trained checkpoint, please say me! Thanks for checking out in detail

keunwoochoi commented 4 years ago

Ok thanks! I'll try some stuff (loading the model, etc) and get back to you.

AppleHolic commented 3 years ago

It seems to be difficult to reproduce experimental result. If my desktop gets idle status, I will refactor this repository for using and training model easily. Close it and I will notice this issue at that time.

AppleHolic / source_separation

Training and model details for the released pre-trained model #17