earthspecies / audio-embeddings

7 stars 2 forks source link

Getting started #1

Closed ShaunSpinelli closed 3 years ago

ShaunSpinelli commented 4 years ago

Hey super keen to get involved with esp and more specifically this project, but not exactly sure where to start. The projects board has order steps and challenges , are those currently being worked or is the project overview a better indication of current goals?

bs commented 3 years ago

Hey @ShaunSpinelli! Welcome! We just finished a big road-mapping session. @radekosmulski is working on this repo, unsupervised audio-audio and @pcbermant is working on the cocktail party problem.

@radekosmulski, could you post an update as a new issue in this repo giving an overview of what we covered Tuesday?

(We'll be posting org wide overviews of what's actively being hacked on in the next weeks.)

radekosmulski commented 3 years ago

Hi @ShaunSpinelli! Great to see you here! 🙂 I'm sorry, I missed the original email notification no your comment posted here hence the delay in response.

All the resources you pointed to contain useful information. The project overview is probably the best to work off in terms of action plan.

For step 1 I worked on a bunch of things in the embedding-gym repository. I tried to gain a better understanding of how text embeddings were evaluated and maybe start to hack on some tools that could be useful to this end. We will want to evaluate our embeddings trained on audio in a similar fashion as embeddings that were trained on text. Since then I found a couple of useful repositories that implement some of the more recent benchmarks, for instance the repo here or here. I think the first one might be more promising, specifically using the evaluate_on_all.py script from the scripts directory.

With regards to the action plan, I jumped to step #3 - I suspect step #2 can be omitted all together (though at the same time implementing it could have some merit). I'm working on implementing the paper referenced in step #3, that is this one. I am in communication with the author - he has been super helpful and there are some details that might not be apparent from the paper.

I am now working on getting the data into proper shape and form for training - I outline an initial attempt [here] but I am continuing the work with processing running as we speak, hoping to push an update maybe even today.

I would like to reproduce the results from the paper, specifically I would also like to train text emebeddings on the librispeech data and run them through embedding evaluation benchmarks. This can be useful in its own right and could also give us confidence whether our implementation is aligned with the one from the paper.

There is an immediate task that I think would be very useful and could be worked on straight away, if it would be of interest to you! Something along the line of the following could work well I think:

  1. Grab embeddings trained for the paper that the authors were super kind to make available here.
  2. Run the benchmarks and see if we receive comparable results. Which implementation is closest to the implementation used in the paper? Are there any other embedding benchmarking frameworks we could use?
  3. See if we can reproduce the results on text embeddings using the fasttext implementation (training on librispeech data ourselves). They used train-clean-100 + train-clean-360 but the challenge might be in how to feed the data the to software and picking the parameters, getting it all to work. This is not necessary to reproduce the main part of the work, but would be nice to figure out. Could help immensely with experimenting on other datasets or creating our own one at some point.
  4. Document the results, ideally along with the commands that were run, either adding it to the repository as a jupyter notebook or as a markdown text.

I will be sharing my work regularly as well - I think once I figure out the processing of data one will be able to start training models on this and possibly experimenting with different archs. I have a couple of ideas myself but still need to iron them out, happy to discuss if there was interest.

Apologies again on the delay in response. If you have any questions please give me a shout here on github - I will make sure to monitor notifications more closely!

Supper happy to learn that you are finding this project exciting 🙂 I am also enjoying working on this very much, very much inspired by the approach authors of Speech2Vec discovered. It strikes me as on one hand very elegant and at the same time outperforming text embeddings is extremely impresive!

ShaunSpinelli commented 3 years ago

Thanks @bs and thanks @radekosmulski for the detailed reply and no worries about the delayed response :)

I just went over evaluating monolingual embeddings in the embeddings-gym to get a better understanding of how text embeddings are evaluated.

So my understanding is that we want to get some tools to evaluate our embeddings and then some benchmarks in performance on the speech2vec embeddings which we we can later use to evaluate our embeddings?

I'll start working on steps 1 and 2, looking at word-embeddings-benchmarks and getting some notebooks around evaluating embeddings and investigate some other possible embedding benchmarking frameworks we could use. Then get some benchmarks up for the speech2vec embeddings.

Not sure about step 3 but I think I just need to get my head into the world of word embeddings again, it's been a while. I should be more helpful around training, experimenting and other areas once I get some better understanding of the domain and put some time in.

radekosmulski commented 3 years ago

Really excited to be working on this together @ShaunSpinelli! 🙂

bs commented 3 years ago

@radekosmulski, given that you dove in at ground zero, what might be a good point that @ShaunSpinelli might pick up that work?

I imagine looking at more modern implementations of word embedding in the human text-text realm?

radekosmulski commented 3 years ago

More elaborate text emebeddings would be great but I am thinking the best approach for now might be to evaluate libraries for benchmarking, see what results we get with the published audio embeddings and to attempt training our own embeddings using fasttext (this might be interesting from the standpoint of reshaping the librispeech data to lend itself to training - I am not exactly sure what the authors of Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech did - would be great to have our results as another point of reference).

JoseMMontoro commented 3 years ago

Hi! I introduced myself here last week and I've been taking this time to go over everything that's going on at ESP - very exciting stuff!

I feel like this project is where I would best be able to, hopefully, bring some value in :) I'm not too experienced of an ML/DL practitioner by any means. But I feel like my background in Linguistics + my work experience tangential to ASR and TTS systems + and my personal interest in NLP would be best used here, if anywhere.

@ShaunSpinelli You're working on evaluating monolingual embedding and on finding benchmarks to use for our models, right? Just to make sure I don't do work you've already done.

I'd actually be curious to try step 3 here. The way I understood it is: train embedding using the fasttextai implementation, as opposed to the skip-gram implementation the authors used here, on the librispeech data, and see how the results compare. Did I get it right @radekosmulski ? 😆 This sounds very interesting to me and I'd like to give it a try.

Let me know if that would work. Thanks!

radekosmulski commented 3 years ago

Hi @JoseMMontoro! Yes, I think you got that right! 🙂 With one tiny correction - skip gram and cbow are two ways of posing the embedding problem, two methods of presenting the train examples along with the labels. Fasttext is one of the implementations of the algorithm, supporting both ways I believe (another popular implementation is gensim but the authors opted to use fasttext).

I am in complete agreement with everything that you wrote. Super excited to be working on this together! Please give me a shout if at any point you'd like to bounce ideas off one another or discuss some aspect of the project!

JoseMMontoro commented 3 years ago

Thanks @radekosmulski ! Sounds great, I'll get to it this week! I'll definitely be reaching out for more specific feedback. Thanks a lot for your guidance.

ShaunSpinelli commented 3 years ago

Hey @JoseMMontoro !

Just a status update and sorry for the late response, just coming off a busy week and long weekend. I've done some initial benchmarks on the speech2vec embeddings, using @radekosmulski initial eval in the 02_basic_model_with_teacher_forcing notebook and web. It's here but still a WIP.

I will have some time in the next for days, so will work on the benchmarks a bit more and see if I can lend a hand in debugging #3 and help get it learning something useful :smile:

radekosmulski commented 3 years ago

Great to hear of the progress you're making on the evaluation @ShaunSpinelli! 🙂

I have been rereading the paper and I think I have found a few spots where my implementation differs from the work of the authors. Mainly, I think I am not creating training examples as they are being created in the paper. Working on correcting this now - will push updated notebooks as I have them ready. Hoping to have something ready within the next 12 - 24 hrs.

bs commented 3 years ago

@JoseMMontoro and @ShaunSpinelli, just a heads up that the latest work is detailed in #6. @radekosmulski is working with the paper author to smooth out some of the rough edges, if you want to take a look!