Closed shaper closed 1 week ago
Hi there, welcome to the first PR and thank you for your consideration and steering as we explore together! A few notes as I begin to find my way:
hubert.py
model. However, it felt bare without the tool to show usage. Open to breaking apart if you'd prefer.logging
instead (we've been using loguru
for a while and much prefer it, by default for example it will log the source file/method and line number for log statements). It's only in the tool, so you can try running it to see what it looks like (loguru
not yet included in requirements.txt
, you'll need to install it).requirements-dev.txt
for now. In other code we do allow for cpu-based token extraction from dataset samples at runtime which isn't really dev/tools
at that point. Not sure we'd want to allow that though as it'd slow the datasets and adds complexity. In any case, perhaps these should just go in requirements.txt
.HubertWithKmeans
but, lol -- how best to test an nn.Module
, one might ask. All the unit test really did was mock out the torch
, fairseq
, etc. and allow the ctor to pass, which isn't particularly useful beyond validating the ctor signature. More useful would be something like a trivial integration test that runs it with the tool on a few files, or at least the model on a fixture audio file, etc. There isn't infra for this today and given it's experimental and so forth it felt like overkill, so I backed out the test work and figured we'd discuss. Similarly, the tool is pretty simple and I'm not sure it warrants tests yet. I'm open to adding whatever you'd like, and can help set up fixture/etc. bits as needed, now or later.models
dir to repo root and to .gitignore
, but open to thought. I know typically one uses HF to download pretrained models and the HF cache handles it, however I don't believe the exact files we want here are set up for this on HF today (which doesn't mean we couldn't put them there, so again, open to feedback).The HF thing was the first thing that jumped out at me - I tend to think we'll be happier if we have a unified way of hosting and downloading models. If this turns into a bear we can of course revisit.
Minor nits:
Actually I know Justin said to separate out the PR, but I'd really like to see the entire changes first.
Just the diff (would it be your main
and ours?) could be good enough.
The things that I'd like to know:
- what datasets were used and with what prompts
- how long did training take and it would be nice to see some training curves
- what are the material changes in training required other than the addition of HuBert tokens for output
Let's follow up soon and happy to have a discussion. We are not yet as far as you describe above though we aim to move very quickly -- we are actively working on proper training next. What we've done so far is roughly:
audio
tag to populate it with said hubert tokensThere are various directions we can go in editing/adding training/datasets with the goal to teach the model to be able to speak HuBERT for responses while preserving existing functionality. I expect it could take time to find the one we like best, and so that we'll be playing around with things here quite a bit. We're optimistic that we can quickly identify some initial training run/dataset changes that are not large in scope and support speech output, and we'd value your input and steering on approach!
Similar to my other comment, with the above as the planned direction, having a model/tool in repo to allow generating the tokens (and iteratively cleaning up and landing some of the bulleted work above, e.g. behind a flag or otherwise generalized/abstracted) seems useful to support a team effort for what's likely long-lived exploration.
Maybe another way to put it all is -- we have several of the likely foundational pieces together or in mind (semantic speech tokens, voice model/vocoder), we're learning the Ultravox internals/tooling and starting to fit them together, and we've done work to de-risk/validate and make sure the end to end flow is feasible and roughly operational. Another key upcoming work item re: foundational pieces is to also include our models for inference purposes.
I will put together a PR to sketch the above bullet points (done-so-far) soon but I don't expect all of it is yet something we'd want to land. This tool/model's case was clearer (but I understand your point re: using existing HF model if it fits).
Thanks for the explanation. I think I understand the disconnect here. I thought you had training working already (which was totally surprising to me).
Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.
I'm aligned with the direction now.
Until now I've generally tended to do highly experimental features in a branch and only push them to main after getting more confident, but that becomes less viable as more people want to collaborate on the same path.
In some projects we had an experimental
or projects
directory under main
for each person/project idea. This directory had looser constraints (e.g. no/less testing, possibly different Python requirements). However I don't think that's necessary for us yet.
I guess my rule of thumb is: highly experimental features related to training/model need to be proven at least somewhat first before being merged, but in your case you've already done the due diligence so we're good.
And as a rule, I'm open to experimental features as long as the existing paths still remain viable and not highly impacted.
Agree with @farzadab. This is definitely part of the roadmap so I don't think we need to worry too much about exactly what directory this lives in, but I do want to be cautious about taking on large dependencies and also align on core concepts like how we load models.
I definitely know that the PR process isn't always the easiest way to communicate, so don't hesitate to post/DM if we're giving unclear/conflicting advice.
Still working on this, also getting familiar with poetry, lol! Removing fairseq would be yet further helpful as it depends on hydra-core
which breaks testing for some reason. More tomorrow.
Repeating to make sure I understand, what you did was some sort of "overfitting" experiment just to make sure it was possible to get the tokens out. Not really a full-fidelity trained model that could respond in voice as it saw fit.
Yes, that's right. Working on the trained model as well in parallel.
@sharvil. is contact point for this workstream, passing further work to him.
For use in extracting discrete HuBERT tokens from audio files.
Adapted from https://github.com/lucidrains/audiolm-pytorch