Train text embeddings and compare with speech2vec text embeddings

The goal of this exercise is to further understand where our implementation and the speech2vec authors' implementation differs.

In the paper, they mention they train word embeddings based on text using the fasttext implementation, to compare against their embeddings based on audio (see sections 3.3 and 3.4 in the paper). They also make both the audio-based and text-based vectors available at https://github.com/iamyuanchung/speech2vec-pretrained-vectors.

This code does the following:

Takes the same dataset sample from the librispeech data as was used in the paper
Trains a word2vec model using the Fasttext implementation available in Gensim, via skipgram approach
Applies some benchmarks to compare both sets of vectors (room for improvement, see #7 )

Thank you very much for this Jose! This is looking great. There is a bit of a room for improvement on the metrics as you suggest - the question is, why the discrepancy in performance? Meaning, it could be that they did something more elaborate on the training side, or trained for longer, and then that would be all good. But it could just as well be that maybe they reshaped the train data in some different way than what we are doing (this could impact speech2vec training as well), or maybe they did some preprocessing on the embeddings after training to make them amenable to the evaluation? (the latter is quite unlikely).

Big questions that can potentially help us with training on speech2vec! Extremely grateful for the work that you have done 🙏

earthspecies / audio-embeddings

Train text embeddings and compare with speech2vec text embeddings #8