fixie-ai / ultravox

MIT License
612 stars 23 forks source link

Experiment with speech/text interleaving #10

Open juberti opened 3 weeks ago

juberti commented 3 weeks ago

Apply ASR (e.g., Whisper) to an existing speech dataset to establish word-level timings, and add said timings as an additional column for the dataset. With this new column, add an option to data/datasets.py to generate samples with interleaved text and speech (e.g., instead of "Transcribe <|audio|>", the input for the utterance "Rice is often served in round bowls" would be of the form "<|audio|> is often <|audio|> in round <|audio|>").

satel33 commented 2 weeks ago

@juberti This is very interesting, I have been trying to establish word-level timings using openai/whisper-small model to generate samples with interleaved text and speech.

So, I am trying to get word-level timing first using whisper-small model, and after add one more column for that. Successfully working to get word-level timing dataset.

But curious what algorithm works to get the interleaved text. I am thinking that I would like to calc some attention weights of each word in the sentence, and get random words with high attention scores and change those into <|audio|> special tokens.

What do you think ?

juberti commented 2 weeks ago

The goal here would just be to allow the model to see interleaved text during stage 1 training, which should help it learn text-audio invariance. So once we have the new column you mention, during training we could run for multiple epochs, and for each sample, randomly choose for each word in that sample whether to use the text or audio representation.

satel33 commented 2 weeks ago

@juberti Yeah, make sense, so we can randomly select world to tokenize with special token.