our model learns acoustic and not semantic embeddings - how can we address this?

radekosmulski commented 3 years ago

Per suggestion from @bs, I am adding a summary of experiments performed exploring why our model might not be learning semantic embeddings. The list is not inclusive, but I will do my best to list the more major experiments and will do my best to do a better job of checking them into the repository.

Now that we know that speech2vec was trained with 100% teacher forcing, will our model train using hyperparameters most likely used during training of speech2vec? (batch size of 32, SGD, learning rate of 1e-3).
1. The model didn't learn semantic embeddings. I attempted to train using hyperparameters that seemed better for quick experimentation to me - batch size of 2024, Adam, learning rate of 1e-3 - but still got same results, loss decreases, the model fails to capture semantics (seems to do well on capturing acoustic features).
Could the model fail to train because it cannot focus on the part of the word that is most useful to the discrimination between similarly sounding words but that have different meaning?
1. I added attention over the outputs of the encoder - no difference in results.
Maybe the model fails to train because the encoder lacks capacity? Maybe it cannot discern between similarly sounding words and thus clusters them together and learns acoustic features because that is all it can do?
1. I added layers to the encoder in an attempt to address this. Loss decreases, model learns, but still no indication that the model goes beyond learning acoustic embeddings.
Maybe the task is to easy for the decoder, now that we are training with 100% teacher forcing? The decoder learns to predict the next set of mfcc features based on the earlier temporal steps and it doesn't really need to utilize the embeddings
1. In an attempt to follow on this idea, I trained a model with a much simpler decoder. The hope was that by keeping the decoder simple we might force more information to be encoded in the embeddings. Unfortunately no progress on the task of learning semantic embeddings.
Can we come up with a simpler task to verify that the model can discern between similarly sounding words based on their semantics?
1. @pcbermant suggested a task that I really liked, I think it is great on several levels. The idea is to see whether our model can discern between plural and singular form of nouns. We take the encoder as is and we pass the resultant 'embeddings' (the last hidden state of the encoder) to a linear classifier. It turns out that our encoder can deliver on this simplified task! This result suggests that our encoder can discern between similarly sounding words and it also indicates that our data preprocessing might be okay, the segmentation of audio is precise enough so that the s at the end of the plural form of a noun can be identified.
Are the word pairs that we generate enough to train any embeddings at all? Let's try training text embeddings on the skipgram pairs we generate and evaluate the results.
1. I performed this experiment based on a suggestion from @JoseMMontoro and very glad I did! I looked at fasttext for training the embeddings, but I couldn't find information on how to pass our word pairs to it in an exact way (we could pass it as a long line of words, but that would not capture the pairings between words exactly). I hacked together a super simple way of training text embeddings... and it worked! The results are not great, but that is not the point. The finding is that everything should be okay with the word pairs that we generate for training. Another piece of the pipeline that this validates is the piece for running the evaluation on embeddings.
Could it be that our architecture is sufficient to learn semantic embeddings, but it requires a more elaborate training schedule? Maybe it is lacking some minor optimizations that are needed to train on this challenging task? To see whether the architecture can train at all towards learning semantic embeddings, what will happen if we jumpstart it with some pretraining?
1. This is a big result I feel - for the first time our model, trained on audio, moves the needle on our evaluation tasks! The results are not great, but I didn't train it for too long. In order to arrive at this outcome, I pretrained the encoder to classify utterances (the classes were the corresponding words). This pretraining makes this task supervised, but that is of little consequence - the important things are what this tells us about our model, about the situation! It seems feasible to learn semantic audio embeddings and we possibly are on the right track. There is a way to extend this approach, to go back to the fully unsupervised regime while still doing pretraining (instead of training a classifier, train an autoencoder!).
The decoder RNN is a complex piece of machinery sitting in between our embeddings and our loss. Can we replace it with something simpler to help the training and be closers in spirit to how word2vec was trained (extremely simple arch)?.
1. I may have overdid on simplicity, but I do believe this is still worth exploring. Fundamentally, I do not see a reason why we should be processing the data in sequence? I am extremely curious what would happen if we simplify both the encoder and decoder, whether this would not be a path worth exploring to speed up the training and possibly reach very good results. A FC layer like here might not be the best, it is to simple, but something that is tailored to work with spatial data (CNNs) could be great to try here.
Not all examples are equal - maybe limiting examples to a higher quality subset, reducing noise, can help with training?
1. I took a deeper look at the dataset, queried it and listened to a couple of examples. There are extremely short utterances which don't seem to contain a lot of information. There are also very long utterances, very few of them, that generally are for very infrequent words or names of things / places. I suspect that we might get better results by being more selective with the examples we train on. It also seems that the pretrained embeddings from speech2vec, published along side the paper, don't contain a full vocabulary. I added length information for each examples. This gives me another idea - let me grab the vocab from the pretrained embeddings and do some sleuthing 🕵️‍♂️ on what approach the authors took.
If I use subjectively better examples, while still balancing epochs, will the RNN encoder - RNN decoder model train?
1. Seems like a dead end - the answer is no.
Maybe balancing of epochs is counter productive? Maybe it makes sense for more common words, the once that have richer context, to appear more frequently during training?
1. This doesn't seem to be the case. I have further verified this by training with pretrained encoder - balancing of epochs gives better results.
Maybe it is better for more frequent words to appear more frequently in the train set? After all, they have a richer context to be learned? Maybe balancing of epochs was the wrong idea? Can I verify this?
1. As it turns out, this was a dead end - balancing of epochs helps with training.
I am trying to discover what preprocessing the authors of the speech2vec paper have done to the dataset - maybe there is something we are missing that helps with training. I can use the pretrained embeddings that they share to reconstruct the vocabulary that they used
1. It turns out that they removed some uncommon words, reducing the possible vocab from 55k+ to 33k+. That will help with running experiments faster and can lead to improved results.
Pulling everything together, training with pretrained encoder (in a supervised manner), will our model train better than before?
1. The model seems to train better, as evaluated by one of the tasks. There is no big improvement. Could it be that training with SGD is essential here? Or maybe there is something else that we are missing?
We observed the model started to learn with supervised pretraining, but this is not something that we want - the idea is to learn in a completely unsupervised way. In this notebook we are attempting to pretrain in a unsupervised way (using autoencoder architecture) and we follow this attempting to train using pretrained encoder weights as a starting point.
1. Unfortunately, the model doesn't train when we pretrain in un unsupervised way 🙁. I presume it is because the model needs a better signal to discern between similarly sounding words that have different meaning. Unsupervised pretraining how we carried it out does not provide this.
I do not fundamentally understand why we need an RNN in the encoder or the decoder. Attempting to train with an architecture that has a TCN encoder and decoder with a receptive field spanning the entire example
1. Again - no go. The model does not learn semantic embeddings.
This problem seems like a great one to be addressed by a siamese network
1. Unfortunately, again, we are not learning semantic embeddings.

We have two model implementations:

Unfortunately, neither model learns embeddings that would capture semantic similarities. The more advanced model learns embeddings that capture acoustic similarities (please see an example of this at the bottom of the notebook, for instance here are the words that are mapped closest in cosine distance to the word 'slow': ['slow', 'low', 'hollow', 'follow', 'fellow'])

I have performed quite extensive troubleshooting of the entire pipeline, especially for the advanced model. As a check, I validated that it can train on a simpler task, specifically as an AutoEncoder. With increased capacity it was able to bring the loss nearly to zero, which is a strong indication that the architecture is okay.

I tried training with a small batch size of 32, SGD and lr of 1e-3 as suggested by the author of the Speech2vec paper. I attempted training with a large batch of 2048, Adam, various learning rates, balancing of epochs by source word, modifying the architecture (various ways of combining hidden states from the encoder to pass to the decoder, increasing the number of layers, introducing nonlinearities), all to no avail.

Taking all this in perspective, assuming the architecture piece is to some extent okay, according to the speech2vec paper the model should learn semantic embeddings as a function of the data that we show it. What could be going on then? There is some chance we are not training in the correct way, not leveraging teacher forcing as we should. But I am not convinced this is the case - no results of the experiments would suggest this.

Maybe our encoder is not able to discern between similarly sounding words? That could be the culprit. But it doesn't seem that increasing the capacity of the encoder makes a difference. Also, as stated earlier, the architecture seems to provably work, though on a slightly different task. Could it be that we have a problem with our data? Maybe we are not processing it in the correct way? Maybe the alignment information that we are using is not as good as it could be?

I would really appreciate any thoughts on this 🙂. Thank you very much for reading and for your consideration.

radekosmulski commented 3 years ago

Big news - just got a reply from one of the authors of the speech2vec paper! :partying_face:

As it turns out, speech2vec was trained with 100% teacher forcing! That is a very important hyperparameter. The next steps for me will be to implement something like the advanced model, but using the cudnn optimized rnn loop with effectively 100% teacher forcing.

This possibly could help the model discern finer details between the input words, I would imagine. Still, one line of inquiry remains - is the LibriSpeech data aligned with words as best as it can be? Should take another look at the code for data creation and possibly search for more information on the montreal forced aligner.

ShaunSpinelli commented 3 years ago

Awesome news @radekosmulski ! Thanks for the detailed update. I have been spending some time brushing up on things I need to understand to be able to contribute a bit more and have been tracking your updates to the notebooks. I will go over the data processing and check the data alignment and see if i can find thing that could be causing issues.

radekosmulski commented 3 years ago

Thank you @ShaunSpinelli! 🙂 Really appreciate another set of eyes looking at this!

radekosmulski commented 3 years ago

Unfortunately, training with 100% teacher forcing didn't result in the model learning semantic embeddings 🙁 I trained for over a day, both with a large BS and Adam and small BS and SGD (to resemble the conditions of training in the paper). Both approaches learn acoustic embeddings. At this point I am starting to believe that there is np substantial difference between training with Adam and SGD for this phase of the exploration - the major difference being that one, as expected, trains much slower than the other. Switching to using Adam and a large BS to be able to run experiments faster.

Based on analysis of results, and conversations with @aza and @pcbermant, I think the best lead we have at the moment is that our encoder is lacking discriminative power, to tell similarly sounding words apart. As suggested by @pcbermant, I am attempting to address this by adding capacity to the encoder. Secondly to this, I created a model with attention over hidden states of the encoder. This should give the model a better ability to focus on the part of the word that is significant to discerning it from other similarly sounding words. That is of course assuming our targets carry enough signal to suggest to the model that learning to discern words better could lead to lowering the loss.

Running both experiments now - there is space to funnel even more compute to these two new mechanisms. Thinking about the model, it seems to me that increasing the capacity of the encoder while keeping the decoder simple is the way to go (we want as much information to make its way into the embeddings and keep the decoder relatively simple, but not too simple so that it cannot act on the information provided by the encoder).

Should have results of these training runs soon. If I don't see an improvement, I might either combine these two mechanisms or add even more capacity to these enhancements to our model, to get a more definite answer whether we are on the right track here or not.

The last bit of information from Yu-An Chung was very helpful - we no longer have to concern ourselves with tailoring the schedule of teacher forcing, which could be quite a time sink. It's great that we can set teacher forcing to always be on, utilize the optimized cudnn lstm loop and focus our attention on the exploring the aspects that are still not clear to us.

radekosmulski commented 3 years ago

The results of the last two experiments (adding layers to encoder, adding an attention mechanism to constructing the embeddings) are in, no changes - the model fails to learn semantic embeddings.

Had a great call with @pcbermant yesterday. He made a great suggestion - it might be beneficial to train on a simpler task to better understand where our model falls short. He also suggested a very interesting sounding task - have the model classify nouns, their plural vs singular forms. This is really useful on multiple levels - first of all, it will tell us if our encoder can distinguish as is between similarly sounding words. Knowing this will be very useful. This will also to some extent validate whether our data preprocessing (or at least the part where we are going from audio to mfcc features) works or not, whether it allows for the distinguishing of similarly sounding words. This will be the next thing I plan on implementing.

I have given more thought on what could be leading to the model learning acoustic and not semantic embeddings, and came to the conclusion here are some possibilities worth exploring:

our data is not sufficient for some reason (experiment above will to some extent validate this, though this area would warrant more exploration)
the task, given 100% teacher forcing, is too easy for the decoder, grouping similarly sounding words suffices - what would be interesting to do on this would be to decrease the capacity of the decoder and see what results we get! I have a couple of ideas here, including how to simplify the architecture will test after implementing the suggestion from @pcbermant (essentially, the idea behind this point is this -> by reducing the capacity of the decoder, we might force more meaningful information to be encoded in the embedding, we do not want the decoder to rely too strongly on the information it derives from teacher forcing, but want to increase its reliance on the embedding)

JoseMMontoro commented 3 years ago

Hi @radekosmulski , I've been following your updates here. The recommendations above seem very promising. Knowing how the model does on a simpler task will be very interesting. Audio is incredibly complex and diverse and creating semantic embeddings directly from audio would be very cool, but definitely challenging. All the ASR models that I know of use some type of Language Model to support the Acoustic Model! I still have to dive deep into the Speech2Vec paper because it seems like it did work for them, though.

Something else I was thinking could be helpful is to train embeddings on the text only. The resulting embeddings would be our "ceiling" for the acoustic embeddings, right? In the sense that the ones trained on acoustic features couldn't get any better than the text embeddings.

By having a benchmark of how well these text embeddings work we would know the delta compared to the acoustic ones. It would also help us understand how good the librispeech data is to train embeddings more generally, in the first place.

Has this already been done?

radekosmulski commented 3 years ago

Thank you for sharing your insights @JoseMMontoro! Really appreciate your thoughts.

I completely agree that the task of learning embeddings from audio seems very hard. That's what makes the result in the paper very remarkable to me. I tried searching the literature but I can't seem to find any other good example of learning embeddings from audio that would capture semantics... There is for instance the Unsupervised Learning of Semantic Audio Representations paper, but as I understand it is more about learning audio features that can be useful for downstream tasks. Then there is the Audio Word2vec by one of the authors of Speech2vec where embeddings capturing acoustic features using an autoencoder are learned - this one is much easier for me to wrap my head around. The jump that occurs in speech2vec from learning acoustic features to semantic features does seem like quite a considerable leap to me!

What is also quite surprising is that in the speech2vec papers the authors report that embeddings learned from audio perform better than embeddings learned from the Librispeech text representation! They used the fasttext implementation for training word embeddings. I have not yet attempted training text embeddings but you are right - that is a very useful path to explore. In the data preprocessing notebook I generate word pairs for training, around 18 million of examples. Would be interesting to see what results we could get on these word pairs - maybe there is something not right with how I generate the data and that leads to us making no progress on learning semantic audio embeddings?

This definitely sounds like a super valuable area to explore further - thanks for suggesting this approach 🙂

radekosmulski commented 3 years ago

Just wanted to give a quick update 🙂. Have an excited new result, following on @JoseMMontoro's suggestion above.

Also, I am now adding a list of experiments to the first comment for this issue. This should make it easier to stay up to date with the work. I am not sure if notifications go out based on updating comments so taking the liberty to ping the folks who joined the discussion in this issue @ShaunSpinelli @JoseMMontoro @kzacarian @aza @bs @pcbermant - for a summary of experiments with findings please check the first comment here!

I have several thoughts on what to try next. I am thinking of using pretrained weights for the encoder (this would not position this as a fully unsupervised task anymore, but that is okay, if our model can learn semantic embeddings once we provide it with a crutch, that might indicate we just need to be more creative with our training schedule. If the model will fail to learn even with pretrained weights, that would be a strong indication there might be something more fundamentally not right with our approach.

The second thought revolves around taking the encoder that we know works at least to some extent and putting it to a test of learning embeddings with an even simpler decoder, one that does not utilize an RNN. I think this can be even more true to how the original method of learning word embeddings from text works. It also removes one complex piece from the model that sits between the embeddings and the loss - it would be very interesting to see what happens once we shorten the path, or at least remove something as complex as an RNN cell with something simpler. Especially that we don't necessarily need to be generating the output one temporal slice at a time! An LSTM layer is so complex, can do so many different things over the 291 steps we ask it to perform, that intuitively it doesn't strike me as an ideal component to have sitting between our loss and our embeddings.

Any thoughts or suggestions are always much welcome 🙂 Super excited to be on this journey together 🙏

radekosmulski commented 3 years ago

I just added point 7 and 8. Super excited about the result in point 7. I also think that the approach from point 8 is extremely promising, worth pursuing.

I think my plan would be to explore both of these further. I suspect following these two lines of inquiry might lead us to training semantic embeddings on this dataset, and learning a lot about training semantic embeddings in general.

aza commented 3 years ago

Here's a thought (not sure if good or bad): what if we do one level of recursion? That is, we know our encoder is learning reasonable acoustic representations (as the noun+plural test showed)... what if we fed those higher-level acoustic representations as input into a language model to produce an even higher-level representation... perhaps semantic?

radekosmulski commented 3 years ago

Thank you @aza for this suggestion - I think it's worth looking into what results the approach that you outline might bring. Will start looking at how we can work towards this.

JoseMMontoro commented 3 years ago

Hi @radekosmulski . I finally was able to read the Speech2Vec paper and have some thoughts/questions I could share with you if that’s ok. But I wanted to hear how your efforts are going first. I’ve read your updates on the first comment in this thread, but how are things going, more generally? What are you exploring at the moment?

radekosmulski commented 3 years ago

Hi @JoseMMontoro, great to hear from you! Would love to hear your thoughts. I updated the list of experiments with a couple new ones. I am now exploring pretraining in an unsupervised manner and also training with SGD. I feel these are our best bets with the current architecture.

There is a decent chance that either of these will work... 🤞🙂 If these fail, I am really tempted to try something completely new. I would like to explore the dataset using something simpler, such as a temporal convolution NN or maybe even a more standard conv2d AE based arch.

These are all loose thoughts at this point - would love to learn what you think 🙂

radekosmulski commented 3 years ago

Pretty big news 😁 Just found a major bug affecting recent experiments:

class Dataset():
    def __init__(self, n):
        self.vocab = vocab * n
    def __len__(self):
        return len(self.vocab)
    def __getitem__(self, idx):
        row_idx = np.random.randint(len(word2row_idxs[self.vocab[idx]])) <------------ 🐛
        source_fn = df.source_fn[row_idx]
        target_fn = df.target_fn[row_idx]
        x = normalize_data(prepare_features(source_fn)).transpose(1, 0)
        y = normalize_data(prepare_features(target_fn)).transpose(1, 0)
        return x, y

The code here doesn't make sense. It potentially made sense earlier on but I probably made some changes or copied some functionality over when working on this dataset and this doesn't make sense now. The line should read: row_idx = np.random.choice(word2row_idxs[self.vocab[idx]])

Experimenting with a new arch (siamese network), also working on speeding up data loading and will rerun experiments 🙂

radekosmulski commented 3 years ago

I reran some experiments with the bug fix, ran some new experiments, added a new notebook to the first issue in this thread.

At this point, I can conclude that I will not be able to make this to work, as is. We might be missing some crucial detail, maybe this approach doesn't work or maybe there is a bug somewhere I am unable to find.

Below I list 3 ways to move forward on this:

Add training wheels - stop treating this as an unsupervised problem. Use the encoder on audio representations but introduce a linear layer (instead of the decoder) and a softmax. Flip the problem around and train regular embeddings (pytorch embeddings layer) and attempt reconstructing the target audio representation from these embeddings. Essentially, we would keep the structure of the data we are training on as is but would stop framing this as an unsupervised problem. This can give us greater insights into which parts of the architecture work. Maybe could lead to a better understanding of the domain that can translate to getting this to work in a fully unsupervised regime.
Reshape the data, change the architecture - instead of using the skipgram formulation, reshape the data to _ _ x _ _, where we use the context words to predict x. Or _ _ _ _ x, use the context to predict the subsequent word.
Pick an arbitrary, modern architecture and attempt training on this data. For instance, attempt training a transformer model. This would introduce a disconnect between the current and the next phase of working on this.

JoseMMontoro commented 3 years ago

Hi @radekosmulski , sorry to hear that. I'm still working on trying to reproduce the embeddings based on text that were trained in the paper. That might be useful also in identifying parts of the architecture that need to be revised.

The options you mentioned above sound promising. I don't fully understand the first one but training on cbow instead of skipgram and using a transformer model are things that I've thought of too, and would be interesting. On the transformer model, why does it introduce a disconnect between this and the next phases?

radekosmulski commented 3 years ago

You are right, it would be a continuation of the work, maybe I didn't use the greatest of wording there saying 'disconnect' when describing option nr 3.

In an ideal world, it would be great if we had an end to end pipeline (from reading in and transforming the data all the way through training to evaluating results) working, before we move onto the next stage of using more complex models. Having something working, even if not great, would be a great source of confidence that all the pieces are working and that we are ready for the next iteration.

Unfortunate wording, agreed, the above is what I had in mind 🙂

radekosmulski commented 3 years ago

I set up an experiment that would fall under section 1 adding training wheels, where I retain the encoder, it runs on the representation of audio, but I got rid of the encoder entirely. Instead, the encoder outputs embeddings that go straight into linear layer and into a softmax..

I think we have arrived at the crux of the matter. Even with this simpler setup, we don't seem to be learning very good embeddings. Yes, there is a tiny uptick on the WS353 task to 0.1, but that is still a very low score. No positive movements on the other two tasks. We probably need to be able to succeed in training on simplified scenario such as this one before we can get the entire architecture to train. It is also interesting and useful to think what this experiment tells us about our setup, where we might have a weak point. There is always the chance there might be a bug somewhere, but I am also thinking maybe we are not doing something right on the way we present data to our model.

radekosmulski commented 3 years ago

On suggestion from @aza, I added a way of visualizing how do embeddings for a pair of words relate to each other. Each point is a single utterance of an embedding, projected using umap to 2d. This can serve as a diagnostic tool allowing us to learn more about what our model is doing.

Below an example of the plots:

JoseMMontoro commented 3 years ago

Hi @radekosmulski, sorry for being unresponsive for a while - got busy with work before the holidays and also just moved to a new place :) Is the bottleneck you mentioned in your previous comments still there? Have you moved to other tasks for now?

radekosmulski commented 3 years ago

Hi @JoseMMontoro! Congrats on the new place! 🙂 I am still exploring the librispeech dataset, but have to some extent now given up on the speech2vec approach. Maybe that is not the right way to put it though. I am slowly starting to put it on the backburner and transition to using other datasets. There are a couple of things I learned in the process, the one from literally this morning is that I observed that when constructing the speech2vec vocab, the words were not lemmatized or tokenized using something more modern like spacy. This is quite interesting and probably has significant implications for the performance of embeddings trained on audio vs text in this manner (in audio "cat" and "cat's" are similar but if you train text embeddings and not do tokenization they end up as two distinct entities, though hopefully with similar embeddings after training, but not on the input side of things). This also has other implications as there are many words that appear in the test tasks that are very infrequent in the dataset... (lack of tokenization only aggravates this problem)

But I could babble on and on 🙂 In summary, I am hoping to continue experiments with data I have greater control over. I am also hoping to start using transformers and training on continuous streams of text.

Apologies on my end as well for being so quiet here on github. Hoping to have more ready soon and will update here!

earthspecies / audio-embeddings

our model learns acoustic and not semantic embeddings - how can we address this? #6