The training set contains the validation set

tae-jun commented 5 years ago

The dataset split file meta/voxlb2_train.txt contains audios in meta/voxlb2_val.txt. The number of training examples is decreased from 1,198,728 to 985,290, when examples in the validation set are removed.

I guess people using this repository are suffering from overfitting because of the split error. Please remove the duplicated examples and re-upload the two split files!

The code below is the one that I used to remove the duplicates using Pandas:

import pandas as pd

df_valid = pd.read_csv(f'meta/voxlb2_valid.txt', sep=' ', names=['path', 'label'])
df_train = pd.read_csv(f'meta/voxlb2_train.txt', sep=' ', names=['path', 'label'])
df_train = df_train[~df_train.path.isin(df_valid.path)]

WeidiXie commented 5 years ago

Thanks for this, actually, as long as you are not using the voxceleb2 test set, it's OK to use any split.

Fan0fan commented 2 years ago

Can you send me the meta/voxlb2_val.txt? I didn't find that maybe the author deleted it. Looking forward to your reply.

WeidiXie / VGG-Speaker-Recognition

The training set contains the validation set #38