fixie-ai / ultravox

MIT License
618 stars 24 forks source link

model: Add initial `HubertWithKmeans` module and extract tool. #22

Closed shaper closed 1 week ago

shaper commented 2 weeks ago

For use in extracting discrete HuBERT tokens from audio files.

Adapted from https://github.com/lucidrains/audiolm-pytorch

shaper commented 2 weeks ago

Hi there, welcome to the first PR and thank you for your consideration and steering as we explore together! A few notes as I begin to find my way:

juberti commented 2 weeks ago

The HF thing was the first thing that jumped out at me - I tend to think we'll be happier if we have a unified way of hosting and downloading models. If this turns into a bear we can of course revisit.

Minor nits:

farzadab commented 2 weeks ago

Actually I know Justin said to separate out the PR, but I'd really like to see the entire changes first. Just the diff (would it be your main and ours?) could be good enough.

shaper commented 2 weeks ago

The things that I'd like to know:

  1. what datasets were used and with what prompts
  2. how long did training take and it would be nice to see some training curves
  3. what are the material changes in training required other than the addition of HuBert tokens for output

Let's follow up soon and happy to have a discussion. We are not yet as far as you describe above though we aim to move very quickly -- we are actively working on proper training next. What we've done so far is roughly:

There are various directions we can go in editing/adding training/datasets with the goal to teach the model to be able to speak HuBERT for responses while preserving existing functionality. I expect it could take time to find the one we like best, and so that we'll be playing around with things here quite a bit. We're optimistic that we can quickly identify some initial training run/dataset changes that are not large in scope and support speech output, and we'd value your input and steering on approach!

Similar to my other comment, with the above as the planned direction, having a model/tool in repo to allow generating the tokens (and iteratively cleaning up and landing some of the bulleted work above, e.g. behind a flag or otherwise generalized/abstracted) seems useful to support a team effort for what's likely long-lived exploration.

Maybe another way to put it all is -- we have several of the likely foundational pieces together or in mind (semantic speech tokens, voice model/vocoder), we're learning the Ultravox internals/tooling and starting to fit them together, and we've done work to de-risk/validate and make sure the end to end flow is feasible and roughly operational. Another key upcoming work item re: foundational pieces is to also include our models for inference purposes.

I will put together a PR to sketch the above bullet points (done-so-far) soon but I don't expect all of it is yet something we'd want to land. This tool/model's case was clearer (but I understand your point re: using existing HF model if it fits).

farzadab commented 2 weeks ago

Thanks for the explanation. I think I understand the disconnect here. I thought you had training working already (which was totally surprising to me).

Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.

I'm aligned with the direction now.

Production vs experimentation and how to handle them:

Until now I've generally tended to do highly experimental features in a branch and only push them to main after getting more confident, but that becomes less viable as more people want to collaborate on the same path.

In some projects we had an experimental or projects directory under main for each person/project idea. This directory had looser constraints (e.g. no/less testing, possibly different Python requirements). However I don't think that's necessary for us yet.

I guess my rule of thumb is: highly experimental features related to training/model need to be proven at least somewhat first before being merged, but in your case you've already done the due diligence so we're good.

And as a rule, I'm open to experimental features as long as the existing paths still remain viable and not highly impacted.

juberti commented 2 weeks ago

Agree with @farzadab. This is definitely part of the roadmap so I don't think we need to worry too much about exactly what directory this lives in, but I do want to be cautious about taking on large dependencies and also align on core concepts like how we load models.

I definitely know that the PR process isn't always the easiest way to communicate, so don't hesitate to post/DM if we're giving unclear/conflicting advice.

shaper commented 2 weeks ago

Still working on this, also getting familiar with poetry, lol! Removing fairseq would be yet further helpful as it depends on hydra-core which breaks testing for some reason. More tomorrow.

shaper commented 2 weeks ago

Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.

Yes, that's right. Working on the trained model as well in parallel.

shaper commented 1 week ago

@sharvil. is contact point for this workstream, passing further work to him.