galv / lingvo-copy

Apache License 2.0
4 stars 0 forks source link

Gender and Accent Recognition Brainstorm #13

Open galv opened 3 years ago

galv commented 3 years ago

After a performant forced alignment pipeline is done, my next thought goes to how to add gender and accent recognition.

First of all, I will assume that each segment output by the forced aligner contains the voice of only one speaker. The forced alignment system depends upon voice activity detection to find silent regions of the audio, so this is a decent assumption, except in the case of interjections (where one speaker speaks over another without waiting for the first speaker to stop).

It is straightforward to consider these supervised learning tasks, but I have other ideas.

Librivox and Common Voice both contain gender labels. As I understand it, Common Voice also contains either region or accent metadata for speakers. Finally, it is possible that we could detect, e.g., Spanish-accented English by using Spanish spoken by native Speakers, even if the region or accent metadata is poor. Finally, CommonVoice and Librivox both contain "clean" audio. It is probably worthwhile to use SpecAugment with whatever training process we use in order to help generalization when applied to archive.org data (I've seen data where it is raining, etc.).

We can use locality sensitive hashing on the "ivector" of each training data audio segment (we ignore the text transcript). An ivector is a fixed length vector summarizing an arbitrary length piece of audio. It is essentially the mean of a gaussian distribution. We would use locality sensitive hashing to put the gender- and accent-labeled data's ivectors into hash buckets.

You can use Euclidean distance or Mahalanobis distance for the distance metric. It's important that all ivectors (which are the mean of a multivariate Gaussian) have the same covariance for these metrics. I'm not 100% certain, but it seems like you could simply normalize the mean of the spectrogram to 0 and the covariance to the identity matrix to 1 for a particular spectrogram for this. Worth double checking. Otherwise, https://en.wikipedia.org/wiki/Bhattacharyya_distance may work.

For unlabeled data, we would also hash the ivectors into the same hash buckets. We would classify an unlabeled ivector by assigning it to the majority accent or gender in that bucket.

What is distinct about this from a discriminative or supervised method is that it has an "open world" assumption. For example, we could have a large number of buckets. If there are no or few labeled ivectors in a particular bucket, that suggests this may be an unusual accent or way of speaking compared to the rest of the dataset. For example, it could be someone with a speech impediment or children's speech, which are probably worth capturing in their own right.

Finally, if you get ivectors, you could conceivably visualize them using t-SNE or whatever is newer or cooler than it nowadays.

galv commented 3 years ago

Using this model https://github.com/tensorflow/models/tree/master/research/audioset for inference on our dataset is a good first pass. It has women's speech, men's speech, and children's speech as labels.

No accent detection, but that's not really the focus of the research group behind Audio Set and Youtube-8M.

One of the things I would like to emphasize is that things can quickly become out of hand here. Even "running a pretrained model" can be complicated if we need to port it to an accelerator.