Open juberti opened 3 weeks ago
@juberti This is very interesting, I have been trying to establish word-level timings using openai/whisper-small
model to generate samples with interleaved text and speech.
So, I am trying to get word-level timing first using whisper-small model, and after add one more column for that. Successfully working to get word-level timing dataset.
But curious what algorithm works to get the interleaved text. I am thinking that I would like to calc some attention weights of each word in the sentence, and get random words with high attention scores and change those into <|audio|> special tokens.
What do you think ?
The goal here would just be to allow the model to see interleaved text during stage 1 training, which should help it learn text-audio invariance. So once we have the new column you mention, during training we could run for multiple epochs, and for each sample, randomly choose for each word in that sample whether to use the text or audio representation.
@juberti Yeah, make sense, so we can randomly select world to tokenize with special token.
Apply ASR (e.g., Whisper) to an existing speech dataset to establish word-level timings, and add said timings as an additional column for the dataset. With this new column, add an option to data/datasets.py to generate samples with interleaved text and speech (e.g., instead of "Transcribe <|audio|>", the input for the utterance "Rice is often served in round bowls" would be of the form "<|audio|> is often <|audio|> in round <|audio|>").