Discussion on the Speech2Vec paper

Hi @radekosmulski , I figured I'd open a new issue to discuss the paper itself so we can keep using #6 for your updates only.

Forgive me in advance if some of my questions have already been discussed/explained elsewhere. I hope this discussion is still productive, even if only to clarify some concepts in the paper for future people contributing to the project.

This is my most general question - I'd like to understand if, at this point, our most immediate goal is to purely reproduce the model that the authors of the paper trained. Is that the case? If so, where are the complications that you have pointed out coming from? Is it because their model architecture, the training steps and the specific training data is not completely defined in the paper? I.e. are there aspects of the training process we're not sure how the authors implemented them?
In particular, do we know what the specific 500 hours of audio they use for training are? Which subset of data (if any) did you choose from all Librispeech audio data for training?
How is the representation of the audio features created? I feel like that's a crucial part of the process, and there's very little mention of it in the paper. They mentioned they use MFCCs but don't go too deep into how it was implemented, how they evaluated the resulting representation (is it vectors?), etc. What does the sequence of acoustic features used for training look like? Can you expand on this aspect?
How are the different "acousticc realizations" of each word (meaning: different ways of pronouncing a word) averaged? Where in the code does this happen?
Are we able to measure the variance between each realization of the same word present in the dataset, same way the authors did in 3.5?
I'm not sure what you refer to by "word pairs" on item 6 in #6. My suggestion was to train embeddings from Librispeech text data, using the word2vec architecture, and compare to the embeddings trained on audio. The authors of the paper did exactly that to compare performance (I was very surprised to see that the embeddings trained on audio perform better on the intrinsic benchmarks! That's very exciting), and they don't mention word pairs as far as I can tell. Can you tell me what you meant by word pairs? Also, could we try to train a vanilla word2vec model just with the full transcriptions of our training data?
Since the authors made the embeddings they trained available here, it could be fun to compare them to some other pre-trained models (trained on text), like the ones by SpaCy. Would this be useful?
I don't think I saw Teacher Forcing mentioned in the paper, but you mentioned in #6 that we know it was used - did you get that info from the authors? How do you think it affects training?
What's the final word embeddings' dimension? The authors talk about 50-dim being enough, is that what you chose?
I was recently reading about the Transformer and how attention can outperform RNNs and LSTMs. Do you think it would be worth implementing an architecture similar to that of the Transformer? On the other hand, I think this would be something to explore after we've been able to emulate the results in the speech2vec paper with the same architecture.

Ok, I think that's everything! 😅 Thanks for your patience!!

Thank you very much @JoseMMontoro for this, really appreciate your note 🙂

This is my most general question - I'd like to understand if, at this point, our most immediate goal is to purely reproduce the model that the authors of the paper trained. Is that the case? If so, where are the complications that you have pointed out coming from? Is it because their model architecture, the training steps and the specific training data is not completely defined in the paper? I.e. are there aspects of the training process we're not sure how the authors implemented them?

Our immediate goal is to purely reproduce the results. The difficulties can be due to either there being a bug somewhere in our code or the paper not being thorough in describing the architecture, nor how the model was trained.

Based on literature review, this is the only paper that discusses (and succeeds at) creating semantic embeddings from audio. The results are extremely good and so I would like to do my best to reproduce the results and be able to abstract the method, so that we can start applying it to other datasets.

2. In particular, do we know what the specific 500 hours of audio they use for training are? Which subset of data (if any) did you choose from all Librispeech audio data for training?

From my communication with one of the authors via email, it was shared with us that train-clean-100 and train-clean-360 subsets of Librispeech were used.

3. How is the representation of the audio features created? I feel like that's a crucial part of the process, and there's very little mention of it in the paper. They mentioned they use MFCCs but don't go too deep into how it was implemented, how they evaluated the resulting representation (is it vectors?), etc. What does the sequence of acoustic features used for training look like? Can you expand on this aspect?

I agree this is one of the components of the pipeline that warrants taking a deeper look at. We extract the MFCC features using this off the shelf package. From reading papers in this space, we observed that the authors use consistently the same parameters for generating the representation. Interestingly, this method of generating MFCC features is the default one supported by the referenced package.

Below I am including an example of a representation (first 20 temporal steps). On the vertical axis, we have the 13 MFCC features. On the horizontal axis, we have 20 initial temporal steps (the examples I am training on now have 69 temporal steps max, and if an utterance is shorter, it gets zero padded).

4. How are the different "acoustic realizations" of each word (meaning: different ways of pronouncing a word) averaged? Where in the code does this happen?

The code in this section does the averaging. We first create a dataset with unique utterances in the dataset (any word can have multiple utterances). We then run the encoder on each of the utterances to produce embeddings. We group embeddings by corresponding word and take their mean.

5. Are we able to measure the variance between each realization of the same word present in the dataset, same way the authors did in 3.5?

We don't have it implemented yet. I agree this would be super interesting to do - I didn't work on this as there are other ways we leverage to verify the quality of trained embeddings. Essentially, we run pretrained speech2vec embeddings (that the authors of the paper share on github) along with ours and compare the results.

6. I'm not sure what you refer to by "word pairs" on item 6 in #6. My suggestion was to train embeddings from Librispeech text data, using the word2vec architecture, and compare to the embeddings trained on audio. The authors of the paper did exactly that to compare performance (I was very surprised to see that the embeddings trained on audio perform better on the intrinsic benchmarks! That's very exciting), and they don't mention word pairs as far as I can tell. Can you tell me what you meant by word pairs? Also, could we try to train a vanilla word2vec model just with the full transcriptions of our training data?

The word pairs are implied when they say 'train using the skipgram method' I believe. It is also something that was explained to me through email from one of the authors. Here is a fragment of the communication that speaks to this:

The seq2seq model always takes one segment as input and outputs one segment. Here's an example utterance: "how are you", and assume we are training speech2vec with skipgram with a window size of 2, we first need to generate all (input, output) pairs, which are (how, are), (how, you), (are, how), (are, you), (you, how), and (you, are). In this case the utterance would produce 6 training examples.

I haven't attempted training text embeddings on the librispeech corpus. Seems the authors used the fasttext implementation . I trained very simple text embeddings here. The thought was to verify that the word pairs that we create are good at least to some extent and that the verification method is working. The results of the experiment support both of these notions.

7. Since the authors made the embeddings they trained available here, it could be fun to compare them to some other pre-trained models (trained on text), like the ones by SpaCy. Would this be useful?

Absolutely! The more we understand about embeddings and the LibriSpeech dataset, the better positioned we will be to address training semantic embeddings on this and other datasets.

8. I don't think I saw Teacher Forcing mentioned in the paper, but you mentioned in #6 that we know it was used - did you get that info from the authors? How do you think it affects training?

This was something that was revealed to me via email communication with one of the authors. I believe this to have fundamental importance. I played a little bit with the architecture from the paper as autoencoder and with this small of an embedding size (50 for instance) I suspect the model would lack capacity to do well even on the reconstruction task. Plus here we get to bombard the model with essentially randomly chosen target word for a given source word. For instance, if in the training set we have the following word pairs ('you', 'awesome'), ('you', 'are'), ('you', 'fast'), the model would not have a way of telling which word it needs to predict when presented with 'you' as the source word. Teacher forcing makes a big difference here and I suspect it is the interplay between teacher forcing and the architecture that leads to training semantic embeddings.

9. What's the final word embeddings' dimension? The authors talk about 50-dim being enough, is that what you chose?

Yes, we use embeddings of dimensionality 50.

10. I was recently reading about the Transformer and how attention can outperform RNNs and LSTMs. Do you think it would be worth implementing an architecture similar to that of the Transformer? On the other hand, I think this would be something to explore after we've been able to emulate the results in the speech2vec paper with the same architecture.

Absolutely! 🙂 We also envision the speech2vec architecture to be a starting point. Removing the RNN from the encoder and decoder does sound very tempting.

Ok, I think that's everything! 😅 Thanks for your patience!!

Appreciate the opportunity to discuss this 🙏

Hi @radekosmulski , thanks a lot for your answers! It clarifies a lot.

One follow up question I have is if the authors confirmed that they used a window size of 2 for skipgram training. Would it make sense to try a different N, instead of two? Not sure how it would affect the results.

I'm still very curious about training a word2vec model on the same librispeech text as we have audio for, and then compare to the text-based pre-trained speech2vec embeddings the authors provided. Hopefully I'll have some time to dive into that soon!

Thanks again for your time!

One follow up question I have is if the authors confirmed that they used a window size of 2 for skipgram training. Would it make sense to try a different N, instead of two? Not sure how it would affect the results.

Yes, I believe they used the same formulation as we, that is they took two preceding and two subsequent words as targets. This is definitely an important area to focus on - if there is something not right with data preprocessing, that could certainly affect the results.

Really appreciate the chance to have this conversation, thank you @JoseMMontoro 🙏

Hi @radekosmulski , I opened #8 with a bit of an experiment. As mentioned in the PR, I was trying to reproduce the text embeddings they trained for the paper (they call them 'word2vec' in the paper), which are available on the repo linked in the PR. I wanted to train our own text embeddings and see if they end up being the same as the ones they shared.

I made a quick comparison using the benchmarks you implemented on notebook #10. Can you help me interpret those results? I'm not sure if that information is enough to tell if the two sets of embeddings are close in their representation of the training data (which is my ultimate goal).

Additionally, it would be great to implement the same benchmarks the authors of speech2vec use in the 3.3 (Evaluation) section of their paper. Have those benchmarks been reproduced in this repo? If not, I can go ahead and try to implement them. I'm referring to this section: We used 13 benchmarks [30] to measure word similarity, including WS-353 [31], WS-353-REL [32], WS-353- SIM, MC-30 [33], RG-65 [34], Rare-Word [35], MEN [36], MTurk-287 [37], MTurk-771 [38], YP-130 [31], SimLex999 [39], Verb-143 [40], and SimVerb-3500 [41].

Please let me know if this is helpful or you think there's another area to investigate that could be more useful. I got started with this because text embeddings I'm what I'm the most familiar with :)

Hey @JoseMMontoro. Evaluating the speech2vec word embeddings from the paper was one of the first things I worked on here. I think I just ended up using this repo weband their evaluate_on_all function. I just went over the benchmarks you highlighted that were done in the paper and compared it to the ones done using the web repo to see what we had covered.

[x] WS-353
[x] WS-353-REL
[x] WS-353- SIM
[ ] MC-30
[x] RG-65
[x] Rare-Word
[x] MEN
[ ] MTurk-287
[ ] MTurk-771
[ ] YP-130
[x] SimLex999
[ ] Verb-143
[ ] SimVerb-3500

There is an MTurk dataset in there but not sure which one or if it's a combination of them both. Not sure if we want to get the benchmarks on the speech2vec embeddings for those outstanding datasets?

Also any ideas on the best way to look at a notebook that's in a pull request, like yours? All that json hurts my brain.

I unfortunately haven't found a good way of looking at PRs apart from following this rather elaborate process of checking them out locally outlined here.

On these semantic tasks, the results are correlation coefficients ranging from -1 to 1. Unfortunately our audio embeddings never went past 0.1 and training text embeddings in a relatively naive way they achieve 0.15 on the MEN task and 0.12 on WS353, no progress on SIMLEX999.

@ShaunSpinelli Awesome, thanks for working on the benchmarks! I can get started with the ones covered in the web repo and get the speech2vec text embeddings results, make sure they match the ones in the paper, and then compare to the ones I just trained.

So, if I understand correctly, the benchmark results can be interpreted as the correlation coefficient of the trained embeddings - the higher the better. Is that right?

Is there an existing implementation that allows to retrieve correlation coefficients between two sets of embeddings, i.e. how "similar" they are to each other?

Is there an existing implementation that allows to retrieve correlation coefficients between two sets of embeddings, i.e. how "similar" they are to each other?

I am unfortunately not aware of a standalone implementation of this. The library I use to run the three tasks for evaluation must implement this, but I have not looked closely if they make this easily accessible.

earthspecies / audio-embeddings

Discussion on the Speech2Vec paper #7