Closed radekosmulski closed 2 years ago
Big news - just got a reply from one of the authors of the speech2vec paper! :partying_face:
As it turns out, speech2vec was trained with 100% teacher forcing! That is a very important hyperparameter. The next steps for me will be to implement something like the advanced model, but using the cudnn optimized rnn loop with effectively 100% teacher forcing.
This possibly could help the model discern finer details between the input words, I would imagine. Still, one line of inquiry remains - is the LibriSpeech data aligned with words as best as it can be? Should take another look at the code for data creation and possibly search for more information on the montreal forced aligner.
Awesome news @radekosmulski ! Thanks for the detailed update. I have been spending some time brushing up on things I need to understand to be able to contribute a bit more and have been tracking your updates to the notebooks. I will go over the data processing and check the data alignment and see if i can find thing that could be causing issues.
Thank you @ShaunSpinelli! π Really appreciate another set of eyes looking at this!
Unfortunately, training with 100% teacher forcing didn't result in the model learning semantic embeddings π I trained for over a day, both with a large BS and Adam and small BS and SGD (to resemble the conditions of training in the paper). Both approaches learn acoustic embeddings. At this point I am starting to believe that there is np substantial difference between training with Adam and SGD for this phase of the exploration - the major difference being that one, as expected, trains much slower than the other. Switching to using Adam and a large BS to be able to run experiments faster.
Based on analysis of results, and conversations with @aza and @pcbermant, I think the best lead we have at the moment is that our encoder is lacking discriminative power, to tell similarly sounding words apart. As suggested by @pcbermant, I am attempting to address this by adding capacity to the encoder. Secondly to this, I created a model with attention over hidden states of the encoder. This should give the model a better ability to focus on the part of the word that is significant to discerning it from other similarly sounding words. That is of course assuming our targets carry enough signal to suggest to the model that learning to discern words better could lead to lowering the loss.
Running both experiments now - there is space to funnel even more compute to these two new mechanisms. Thinking about the model, it seems to me that increasing the capacity of the encoder while keeping the decoder simple is the way to go (we want as much information to make its way into the embeddings and keep the decoder relatively simple, but not too simple so that it cannot act on the information provided by the encoder).
Should have results of these training runs soon. If I don't see an improvement, I might either combine these two mechanisms or add even more capacity to these enhancements to our model, to get a more definite answer whether we are on the right track here or not.
The last bit of information from Yu-An Chung was very helpful - we no longer have to concern ourselves with tailoring the schedule of teacher forcing, which could be quite a time sink. It's great that we can set teacher forcing to always be on, utilize the optimized cudnn lstm loop and focus our attention on the exploring the aspects that are still not clear to us.
The results of the last two experiments (adding layers to encoder, adding an attention mechanism to constructing the embeddings) are in, no changes - the model fails to learn semantic embeddings.
Had a great call with @pcbermant yesterday. He made a great suggestion - it might be beneficial to train on a simpler task to better understand where our model falls short. He also suggested a very interesting sounding task - have the model classify nouns, their plural vs singular forms. This is really useful on multiple levels - first of all, it will tell us if our encoder can distinguish as is between similarly sounding words. Knowing this will be very useful. This will also to some extent validate whether our data preprocessing (or at least the part where we are going from audio to mfcc features) works or not, whether it allows for the distinguishing of similarly sounding words. This will be the next thing I plan on implementing.
I have given more thought on what could be leading to the model learning acoustic and not semantic embeddings, and came to the conclusion here are some possibilities worth exploring:
Hi @radekosmulski , I've been following your updates here. The recommendations above seem very promising. Knowing how the model does on a simpler task will be very interesting. Audio is incredibly complex and diverse and creating semantic embeddings directly from audio would be very cool, but definitely challenging. All the ASR models that I know of use some type of Language Model to support the Acoustic Model! I still have to dive deep into the Speech2Vec paper because it seems like it did work for them, though.
Something else I was thinking could be helpful is to train embeddings on the text only. The resulting embeddings would be our "ceiling" for the acoustic embeddings, right? In the sense that the ones trained on acoustic features couldn't get any better than the text embeddings.
By having a benchmark of how well these text embeddings work we would know the delta compared to the acoustic ones. It would also help us understand how good the librispeech data is to train embeddings more generally, in the first place.
Has this already been done?
Thank you for sharing your insights @JoseMMontoro! Really appreciate your thoughts.
I completely agree that the task of learning embeddings from audio seems very hard. That's what makes the result in the paper very remarkable to me. I tried searching the literature but I can't seem to find any other good example of learning embeddings from audio that would capture semantics... There is for instance the Unsupervised Learning of Semantic Audio Representations paper, but as I understand it is more about learning audio features that can be useful for downstream tasks. Then there is the Audio Word2vec by one of the authors of Speech2vec where embeddings capturing acoustic features using an autoencoder are learned - this one is much easier for me to wrap my head around. The jump that occurs in speech2vec from learning acoustic features to semantic features does seem like quite a considerable leap to me!
What is also quite surprising is that in the speech2vec papers the authors report that embeddings learned from audio perform better than embeddings learned from the Librispeech text representation! They used the fasttext implementation for training word embeddings. I have not yet attempted training text embeddings but you are right - that is a very useful path to explore. In the data preprocessing notebook I generate word pairs for training, around 18 million of examples. Would be interesting to see what results we could get on these word pairs - maybe there is something not right with how I generate the data and that leads to us making no progress on learning semantic audio embeddings?
This definitely sounds like a super valuable area to explore further - thanks for suggesting this approach π
Just wanted to give a quick update π. Have an excited new result, following on @JoseMMontoro's suggestion above.
Also, I am now adding a list of experiments to the first comment for this issue. This should make it easier to stay up to date with the work. I am not sure if notifications go out based on updating comments so taking the liberty to ping the folks who joined the discussion in this issue @ShaunSpinelli @JoseMMontoro @kzacarian @aza @bs @pcbermant - for a summary of experiments with findings please check the first comment here!
I have several thoughts on what to try next. I am thinking of using pretrained weights for the encoder (this would not position this as a fully unsupervised task anymore, but that is okay, if our model can learn semantic embeddings once we provide it with a crutch, that might indicate we just need to be more creative with our training schedule. If the model will fail to learn even with pretrained weights, that would be a strong indication there might be something more fundamentally not right with our approach.
The second thought revolves around taking the encoder that we know works at least to some extent and putting it to a test of learning embeddings with an even simpler decoder, one that does not utilize an RNN. I think this can be even more true to how the original method of learning word embeddings from text works. It also removes one complex piece from the model that sits between the embeddings and the loss - it would be very interesting to see what happens once we shorten the path, or at least remove something as complex as an RNN cell with something simpler. Especially that we don't necessarily need to be generating the output one temporal slice at a time! An LSTM layer is so complex, can do so many different things over the 291 steps we ask it to perform, that intuitively it doesn't strike me as an ideal component to have sitting between our loss and our embeddings.
Any thoughts or suggestions are always much welcome π Super excited to be on this journey together π
I just added point 7 and 8. Super excited about the result in point 7. I also think that the approach from point 8 is extremely promising, worth pursuing.
I think my plan would be to explore both of these further. I suspect following these two lines of inquiry might lead us to training semantic embeddings on this dataset, and learning a lot about training semantic embeddings in general.
Here's a thought (not sure if good or bad): what if we do one level of recursion? That is, we know our encoder is learning reasonable acoustic representations (as the noun+plural test showed)... what if we fed those higher-level acoustic representations as input into a language model to produce an even higher-level representation... perhaps semantic?
Thank you @aza for this suggestion - I think it's worth looking into what results the approach that you outline might bring. Will start looking at how we can work towards this.
Hi @radekosmulski . I finally was able to read the Speech2Vec paper and have some thoughts/questions I could share with you if thatβs ok. But I wanted to hear how your efforts are going first. Iβve read your updates on the first comment in this thread, but how are things going, more generally? What are you exploring at the moment?
Hi @JoseMMontoro, great to hear from you! Would love to hear your thoughts. I updated the list of experiments with a couple new ones. I am now exploring pretraining in an unsupervised manner and also training with SGD. I feel these are our best bets with the current architecture.
There is a decent chance that either of these will work... π€π If these fail, I am really tempted to try something completely new. I would like to explore the dataset using something simpler, such as a temporal convolution NN or maybe even a more standard conv2d AE based arch.
These are all loose thoughts at this point - would love to learn what you think π
Pretty big news π Just found a major bug affecting recent experiments:
class Dataset():
def __init__(self, n):
self.vocab = vocab * n
def __len__(self):
return len(self.vocab)
def __getitem__(self, idx):
row_idx = np.random.randint(len(word2row_idxs[self.vocab[idx]])) <------------ π
source_fn = df.source_fn[row_idx]
target_fn = df.target_fn[row_idx]
x = normalize_data(prepare_features(source_fn)).transpose(1, 0)
y = normalize_data(prepare_features(target_fn)).transpose(1, 0)
return x, y
The code here doesn't make sense. It potentially made sense earlier on but I probably made some changes or copied some functionality over when working on this dataset and this doesn't make sense now. The line should read: row_idx = np.random.choice(word2row_idxs[self.vocab[idx]])
Experimenting with a new arch (siamese network), also working on speeding up data loading and will rerun experiments π
I reran some experiments with the bug fix, ran some new experiments, added a new notebook to the first issue in this thread.
At this point, I can conclude that I will not be able to make this to work, as is. We might be missing some crucial detail, maybe this approach doesn't work or maybe there is a bug somewhere I am unable to find.
Below I list 3 ways to move forward on this:
_ _ x _ _
, where we use the context words to predict x
. Or _ _ _ _ x
, use the context to predict the subsequent word.Hi @radekosmulski , sorry to hear that. I'm still working on trying to reproduce the embeddings based on text that were trained in the paper. That might be useful also in identifying parts of the architecture that need to be revised.
The options you mentioned above sound promising. I don't fully understand the first one but training on cbow instead of skipgram and using a transformer model are things that I've thought of too, and would be interesting. On the transformer model, why does it introduce a disconnect between this and the next phases?
You are right, it would be a continuation of the work, maybe I didn't use the greatest of wording there saying 'disconnect' when describing option nr 3.
In an ideal world, it would be great if we had an end to end pipeline (from reading in and transforming the data all the way through training to evaluating results) working, before we move onto the next stage of using more complex models. Having something working, even if not great, would be a great source of confidence that all the pieces are working and that we are ready for the next iteration.
Unfortunate wording, agreed, the above is what I had in mind π
I set up an experiment that would fall under section 1 adding training wheels, where I retain the encoder, it runs on the representation of audio, but I got rid of the encoder entirely. Instead, the encoder outputs embeddings that go straight into linear layer and into a softmax..
I think we have arrived at the crux of the matter. Even with this simpler setup, we don't seem to be learning very good embeddings. Yes, there is a tiny uptick on the WS353 task to 0.1, but that is still a very low score. No positive movements on the other two tasks. We probably need to be able to succeed in training on simplified scenario such as this one before we can get the entire architecture to train. It is also interesting and useful to think what this experiment tells us about our setup, where we might have a weak point. There is always the chance there might be a bug somewhere, but I am also thinking maybe we are not doing something right on the way we present data to our model.
On suggestion from @aza, I added a way of visualizing how do embeddings for a pair of words relate to each other. Each point is a single utterance of an embedding, projected using umap to 2d. This can serve as a diagnostic tool allowing us to learn more about what our model is doing.
Below an example of the plots:
Hi @radekosmulski, sorry for being unresponsive for a while - got busy with work before the holidays and also just moved to a new place :) Is the bottleneck you mentioned in your previous comments still there? Have you moved to other tasks for now?
Hi @JoseMMontoro! Congrats on the new place! π I am still exploring the librispeech dataset, but have to some extent now given up on the speech2vec approach. Maybe that is not the right way to put it though. I am slowly starting to put it on the backburner and transition to using other datasets. There are a couple of things I learned in the process, the one from literally this morning is that I observed that when constructing the speech2vec vocab, the words were not lemmatized or tokenized using something more modern like spacy. This is quite interesting and probably has significant implications for the performance of embeddings trained on audio vs text in this manner (in audio "cat" and "cat's" are similar but if you train text embeddings and not do tokenization they end up as two distinct entities, though hopefully with similar embeddings after training, but not on the input side of things). This also has other implications as there are many words that appear in the test tasks that are very infrequent in the dataset... (lack of tokenization only aggravates this problem)
But I could babble on and on π In summary, I am hoping to continue experiments with data I have greater control over. I am also hoping to start using transformers and training on continuous streams of text.
Apologies on my end as well for being so quiet here on github. Hoping to have more ready soon and will update here!
Per suggestion from @bs, I am adding a summary of experiments performed exploring why our model might not be learning semantic embeddings. The list is not inclusive, but I will do my best to list the more major experiments and will do my best to do a better job of checking them into the repository.
s
at the end of the plural form of a noun can be identified.We have two model implementations:
Unfortunately, neither model learns embeddings that would capture semantic similarities. The more advanced model learns embeddings that capture acoustic similarities (please see an example of this at the bottom of the notebook, for instance here are the words that are mapped closest in cosine distance to the word 'slow':
['slow', 'low', 'hollow', 'follow', 'fellow']
)I have performed quite extensive troubleshooting of the entire pipeline, especially for the advanced model. As a check, I validated that it can train on a simpler task, specifically as an AutoEncoder. With increased capacity it was able to bring the loss nearly to zero, which is a strong indication that the architecture is okay.
I tried training with a small batch size of 32, SGD and lr of 1e-3 as suggested by the author of the Speech2vec paper. I attempted training with a large batch of 2048, Adam, various learning rates, balancing of epochs by source word, modifying the architecture (various ways of combining hidden states from the encoder to pass to the decoder, increasing the number of layers, introducing nonlinearities), all to no avail.
Taking all this in perspective, assuming the architecture piece is to some extent okay, according to the speech2vec paper the model should learn semantic embeddings as a function of the data that we show it. What could be going on then? There is some chance we are not training in the correct way, not leveraging teacher forcing as we should. But I am not convinced this is the case - no results of the experiments would suggest this.
Maybe our encoder is not able to discern between similarly sounding words? That could be the culprit. But it doesn't seem that increasing the capacity of the encoder makes a difference. Also, as stated earlier, the architecture seems to provably work, though on a slightly different task. Could it be that we have a problem with our data? Maybe we are not processing it in the correct way? Maybe the alignment information that we are using is not as good as it could be?
I would really appreciate any thoughts on this π. Thank you very much for reading and for your consideration.