MusicGen text encoder substitution

facebookresearch / audiocraft

Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable music generation LM with textual and melodic conditioning.

MIT License

20.15k stars 2.01k forks source link

Question

How to apply my text encoder into the MusicGen model?

My Understanding of the MusicGen Model

In case my question is unclear due to my understanding of the model is wrong, please correct me.

Input the text into a T5 model, and save its hidden state.
Put the hidden state we just got into a Transformer, and we get EnCodec tokens.
Feed the Encodec tokens to the Encodec Decoder, and we get music.

What I want to do

Replace the T5 encoder with what I trained, supposing that the feature is still the same for the same music target.

facebookresearch / audiocraft