Closed keunwoochoi closed 3 years ago
@keunwoochoi
I'm sure with default code to reproduce result that is uploaded checkpoint, validation and test samples. I got a test with default source code and it reproduced completly. And There is no conv2d code on my repository.
But, the problem is being with audioset. Few days ago, I find out audioset_augmentor source comes not work from sometime that after I did download audioset. Many requests in short time is prevented with youtube-dl package. I think youtube add protection logic to prevent crawling their videos or some prevention code is added to that package. So, If I cannot find other way to download youtube audios, audioset_augment will be deprecated.
Finally, summarizing out that issue, I can share only checkpoint file, validation and test samples that trained with audioset files in general sound case. But it is still able to reproduce singing voice separation.
By the way, at your time for getting audioset files, it could be possible to download audios.
Let's check it out, Is it sure your the number of audio files are about 18k (it could be reduced with license issue, downloading audio can be rejected with updating license) of human-class after downloading and pre-processing audios?
your the number of audio files are about 18k
Yes, I also had the issue with youtube-dl but managed to work out on my laptop and I got about 18k files.
And thanks for the answer but the training data is still unclear to me. When you were training the uploaded checkpoint, was the training data Voicebank, audioset, or both?
This code line is augmention point that using audioset.
Ok, then, you should check it below. Is it prints out correct number of audios?
from audioset_augmentor.augmentor import AUDIO_LIST
print(len(AUDIO_LIST))
Hm, I understood the code. But my question is about which dataset was used exactly for the released model - AudioSet? Voicebank? I'm confused because the readme
reads like Audioset was used, but previously when we were chatting you mentioned Audioset actually didn't seem to help.
To answer your question, I don't actually know, because I didn't run your code as it is. I run this code below to get 12147 non-speech audio files from the balanced set. Not sure if it helps but anyway..
def gnu_main(meta_path: str = '../assets/unbalanced_train_segments.csv'):
# load config
meta_info = get_audio_info(meta_path)
# make args
args_list = []
human_filenames, nonhuman_filenames = [], []
noise_human_filenames = []
noise_nonhuman_filenames = []
human_ids = collect_all_ids()
noise_ids = collect_all_ids('/m/096m7z')
for item in meta_info.iterrows():
renamed_yt_id = item[1]['YTID'].replace('-', 'XX')
tag_id = item[1]['positive_labels'].split(',')
is_human_sample = any([i in human_ids for i in tag_id])
is_noise_sample = any([i in noise_ids for i in tag_id])
if is_human_sample:
human_filenames.append(renamed_yt_id)
else:
nonhuman_filenames.append(renamed_yt_id)
if is_noise_sample:
if is_human_sample:
noise_human_filenames.append(renamed_yt_id)
else:
noise_nonhuman_filenames.append(renamed_yt_id)
list_to_txt(human_filenames, 'unbalanced_audioset_speech_ids.txt')
list_to_txt(nonhuman_filenames, 'unbalanced_audioset_nonspeech_ids.txt')
list_to_txt(noise_human_filenames, 'unbalanced_audioset_noise_human_ids.txt')
list_to_txt(noise_nonhuman_filenames, 'unbalanced_audioset_noise_nonhuman_ids.txt')
print(len(human_filenames), len(nonhuman_filenames), len(noise_human_filenames), len(noise_nonhuman_filenames))
# balanceed set: 2351 19809 32 96
# unbalanced set: 128997 1912792 732 1252
def gnu_further():
my_filenames = txt_to_list('filenames_all.txt')
nonspeech_filenames = set(txt_to_list('balanced_audioset_nonspeech_ids.txt'))
# import ipdb; ipdb.set_trace()
print(len(my_filenames))
my_filenames = [f for f in my_filenames if f.split('/')[1].replace('.wav', '') in nonspeech_filenames]
print(len(my_filenames))
list_to_txt(my_filenames, 'filenames.txt')
list_to_txt(my_filenames[:-500], 'filenames_train.txt')
list_to_txt(my_filenames[-500:], 'filenames_test.txt')
# 2019-10-24. ugh, out of 13549, now i only use 12147 non-speech ones as noise signals. meaning, 10% of them was speech.
The number of downloaded audioset can reduced with license issues (youtube-dl can only download video that has not license issues) or network issues.
As I ran again experiments, audioset helps inference quality exactly. And It is trained with voice bank.
Hmm.. In another view, inference code can generate similar result with pre-trained checkpoint? If it can generate samples rightly, it seems be checked out voice bank processing codes.
Ok, voicebank only. Thanks! About 1-2 months ago I tried to load your checkpoint, but there was something wrong. I think it was a layer name mismatch but don't remember exactly. I'll try it soon and let you know.
Ah not only model is trained on voice bank but also audioset augmentation is contained!
Hmm.. I think pre-trained checkpoint should be tested other environment. If you cannot generate similar result with pre-trained checkpoint, please say me! Thanks for checking out in detail
Ok thanks! I'll try some stuff (loading the model, etc) and get back to you.
It seems to be difficult to reproduce experimental result. If my desktop gets idle status, I will refactor this repository for using and training model easily. Close it and I will notice this issue at that time.
Hi, after asking a few questions I'm still not 100% sure how the released model was trained (that was used for these samples https://drive.google.com/open?id=1CafFnqWn_QvVPu2feNLn6pnjRYIa_rbP).
Was it, at the end, audioset or not? It was
conv1d
notconv2d
, right? I'd really love to reproduce it but can't really figure out perfectly. Can you kindly provide more details?Thanks :)