model: Add initial `HubertWithKmeans` module and extract tool.

shaper commented 2 weeks ago

For use in extracting discrete HuBERT tokens from audio files.

Adapted from https://github.com/lucidrains/audiolm-pytorch

shaper commented 2 weeks ago

Hi there, welcome to the first PR and thank you for your consideration and steering as we explore together! A few notes as I begin to find my way:

initially I had a PR with only the hubert.py model. However, it felt bare without the tool to show usage. Open to breaking apart if you'd prefer.
have you seen https://github.com/Delgan/loguru and are you open to trying it out? If you'd prefer I can swap in logging instead (we've been using loguru for a while and much prefer it, by default for example it will log the source file/method and line number for log statements). It's only in the tool, so you can try running it to see what it looks like (loguru not yet included in requirements.txt, you'll need to install it).
I put the new deps in requirements-dev.txt for now. In other code we do allow for cpu-based token extraction from dataset samples at runtime which isn't really dev/tools at that point. Not sure we'd want to allow that though as it'd slow the datasets and adds complexity. In any case, perhaps these should just go in requirements.txt.
testing: I started a unit test for HubertWithKmeans but, lol -- how best to test an nn.Module, one might ask. All the unit test really did was mock out the torch, fairseq, etc. and allow the ctor to pass, which isn't particularly useful beyond validating the ctor signature. More useful would be something like a trivial integration test that runs it with the tool on a few files, or at least the model on a fixture audio file, etc. There isn't infra for this today and given it's experimental and so forth it felt like overkill, so I backed out the test work and figured we'd discuss. Similarly, the tool is pretty simple and I'm not sure it warrants tests yet. I'm open to adding whatever you'd like, and can help set up fixture/etc. bits as needed, now or later.
the tool has to download the HuBERT checkpoint and kmeans data and store it locally. I didn't see clear precedence for this, so I added a new models dir to repo root and to .gitignore, but open to thought. I know typically one uses HF to download pretrained models and the HF cache handles it, however I don't believe the exact files we want here are set up for this on HF today (which doesn't mean we couldn't put them there, so again, open to feedback).

juberti commented 2 weeks ago

The HF thing was the first thing that jumped out at me - I tend to think we'll be happier if we have a unified way of hosting and downloading models. If this turns into a bear we can of course revisit.

Minor nits:

please add mypy types
please follow Google Python style guide and import only packages, not individual classes

farzadab commented 2 weeks ago

Actually I know Justin said to separate out the PR, but I'd really like to see the entire changes first. Just the diff (would it be your main and ours?) could be good enough.

shaper commented 2 weeks ago

The things that I'd like to know:

what datasets were used and with what prompts

how long did training take and it would be nice to see some training curves

what are the material changes in training required other than the addition of HuBert tokens for output

Let's follow up soon and happy to have a discussion. We are not yet as far as you describe above though we aim to move very quickly -- we are actively working on proper training next. What we've done so far is roughly:

add audio_hubert_tokens to voice samples
increase the model token space to allow for hubert tokens
repurpose the audio tag to populate it with said hubert tokens
trivial training with a straightforward test prompt that produces said tokens, so as to:
ensure that the ultravox model can ingest hubert tokens and produce them at inference time
ensure that our acoustic/vocoder models can take those tokens and produce quality speech

There are various directions we can go in editing/adding training/datasets with the goal to teach the model to be able to speak HuBERT for responses while preserving existing functionality. I expect it could take time to find the one we like best, and so that we'll be playing around with things here quite a bit. We're optimistic that we can quickly identify some initial training run/dataset changes that are not large in scope and support speech output, and we'd value your input and steering on approach!

Similar to my other comment, with the above as the planned direction, having a model/tool in repo to allow generating the tokens (and iteratively cleaning up and landing some of the bulleted work above, e.g. behind a flag or otherwise generalized/abstracted) seems useful to support a team effort for what's likely long-lived exploration.

Maybe another way to put it all is -- we have several of the likely foundational pieces together or in mind (semantic speech tokens, voice model/vocoder), we're learning the Ultravox internals/tooling and starting to fit them together, and we've done work to de-risk/validate and make sure the end to end flow is feasible and roughly operational. Another key upcoming work item re: foundational pieces is to also include our models for inference purposes.

I will put together a PR to sketch the above bullet points (done-so-far) soon but I don't expect all of it is yet something we'd want to land. This tool/model's case was clearer (but I understand your point re: using existing HF model if it fits).

farzadab commented 2 weeks ago

Thanks for the explanation. I think I understand the disconnect here. I thought you had training working already (which was totally surprising to me).

Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.

I'm aligned with the direction now.

Production vs experimentation and how to handle them:

Until now I've generally tended to do highly experimental features in a branch and only push them to main after getting more confident, but that becomes less viable as more people want to collaborate on the same path.

In some projects we had an experimental or projects directory under main for each person/project idea. This directory had looser constraints (e.g. no/less testing, possibly different Python requirements). However I don't think that's necessary for us yet.

I guess my rule of thumb is: highly experimental features related to training/model need to be proven at least somewhat first before being merged, but in your case you've already done the due diligence so we're good.

And as a rule, I'm open to experimental features as long as the existing paths still remain viable and not highly impacted.

juberti commented 2 weeks ago

Agree with @farzadab. This is definitely part of the roadmap so I don't think we need to worry too much about exactly what directory this lives in, but I do want to be cautious about taking on large dependencies and also align on core concepts like how we load models.

I definitely know that the PR process isn't always the easiest way to communicate, so don't hesitate to post/DM if we're giving unclear/conflicting advice.

shaper commented 2 weeks ago

Still working on this, also getting familiar with poetry, lol! Removing fairseq would be yet further helpful as it depends on hydra-core which breaks testing for some reason. More tomorrow.

shaper commented 2 weeks ago

Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.

Yes, that's right. Working on the trained model as well in parallel.

shaper commented 1 week ago

@sharvil. is contact point for this workstream, passing further work to him.

fixie-ai / ultravox

model: Add initial `HubertWithKmeans` module and extract tool. #22

Production vs experimentation and how to handle them: