feature extraction - Githubissues

ramesh720 commented 5 years ago

sir can you explain how you extracted features for every utterance. and finally for one speaker the features will be (12,40,180) can you explain dimentions.

HarryVolek commented 5 years ago

The WAV files in the TIMIT dataset are organized by speaker.

The preprocessing script loads a WAV file and segments the resulting array into periods of non-silence. If the resulting segment length is less than a threshold defined in the script, it is discarded.

The remaining segments are transformed into a mel spectrogram of length (40 (spectrogram energies), N (time steps)). The first and last 180 time steps of the spectrogram are saved.

The script goes through all WAV files belonging to a single speaker, and concatenates the mel spectrograms into a single numpy ndarray. The ndarray is saved as a file .npy file.

For the first speaker, the saved ndarray has the dimensions (12, 40, 180). The dimensions are 12 spectrogram segments (obtained from every TIMIT WAV file associated with the speaker), and each segment has 40 mel spectrogram energies per time step and is 180 time steps long.

The number of spectrogram energies and spectrogram time steps can be modified in the config.yaml by changing the following entries:

    nmels: 40 #Number of mel energies
    tisv_frame: 180 #Max number of time steps in input after preprocess

ramesh720 commented 5 years ago

thank you sir

ramesh720 commented 5 years ago

i am trying to apply this code for NIST data. after data_preprossesing i am getting (0,) after setting tisv_frame: 15000 frames each utterance size of 5 minutes and 30,000 frames

HarryVolek / PyTorch_Speaker_Verification

feature extraction #3