.wav inputs specifics - Githubissues

loregagliard commented 5 years ago

Hi guys, I have a question regarding the input wav files used for training. What are the audio format specifications? I used voxceleb ( http://www.robots.ox.ac.uk/~vgg/data/voxceleb/ ) as dataset, but it is giving me some troubles. Do you know about any other usable dataset?

Thank you ;)

imranparuk commented 5 years ago

Voxceleb is a good one, could you be more specific on what issues you are having? From what I know, you need mono wav files for one. The shape needs to be 2 dimensional.

loregagliard commented 5 years ago

Does it work for you? Because I am checking the files and sometimes there are noisy audios, even though the interviewed talks most of the time (like the interviewer talking, a guitar playing, ecc...). I made it run with a batch size of 3 and it gives me a train accuracy of 100 or 0,no middle values. With larger batches I don't have much more fortune. I know that the dataset has to have a Voice Activity Detection to remove silence and be effective, maybe it is that. What algorithm did you use? I'd like to know if there were boundaries on the 'quality' of the audio files.

MSAlghamdi commented 5 years ago

@loregagliard

I made it run with a batch size of 3 and it gives me a train accuracy of 100 or 0,no middle values.

I have the same issue due to the features map values. Did you use input_feature.py published with the project as it is? If you did, then the problem is in the input features coming out from input_feature.py. I think (correct me if I'm wrong) that's because it uses the log-energy which most likely negative values.

Please let me know once you solve this issue since I'm stuck with it.

imranparuk commented 5 years ago

Hi guys, if you read the paper the author does do VAD, however he did state it was done in Matlab if I'm not mistaken. You will be able to find some VAD solutions in python but they do not produce good results. My advice is to not worry about the VAD. The models will work without them. Please try out the keras implication here - > https://github.com/imranparuk/speaker-recognition-3d-cnn and see if that works for you. It's a working progress, then if that works it will help you understand what is being accomplished in this repository.

loregagliard commented 5 years ago

@MSAlghamdi

Did you use input_feature.py published with the project as it is?

Yes, the input_feature.py file is the same as the project. I just added code to generate the hdf5 files, 'development_sample_dataset_speaker.hdf5' and 'enrollment-evaluation_sample_dataset.hdf5'. I took the code from other issues discussions here. Just to be completely clear, VoxCeleb appears to me as a directory containing directories of various identities. Each of those speaker-directories cointains sub-directories containing the wav files (finally!). Thus I generated the dataset by copying the audios and adding the name of the speaker-directory and the sub-directory to the file name. The speaker labels has been generated by applying ASCII table to the name of the speaker-directories and then reindexed to be 0,1,2,3,... . The audios have a duration range from a bunch of seconds to minutes. Maybe should I merge the audio of a certain speaker? Anyway I chose the even files to be my training set and the odds to be the testing set, so that each speaker have a sufficient number of audios. Is there a way to feed just a feature to the network and see what is the outcome?

Thank you! ( and happy new year!!)

astorfi commented 5 years ago

Dear all,

Please refer to the Pytorch Implementation which uses VoxCeleb dataset.

astorfi / 3D-convolutional-speaker-recognition

.wav inputs specifics #49