Questions about the pre-trained language model

facebookresearch / encodec

State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.

MIT License

3.5k stars 304 forks source link

Questions about the pre-trained language model #15

Open Xue9901 opened 2 years ago

Xue9901 commented 2 years ago

❓ Questions

Thanks for the great work and the shared code! I have some questions about the pre-trained transformer language model：

Could you explain more details about the supervision for training the transformer(shown as L_l in Fig 1 in your paper)? My understanding is that you use a pre-trained language model and train some linear layers to model the distribution of codewords for each frame, but is there any other supervision for modeling the distribution or is the transformer also joint optimized with the whole encoder and decoder?

Looking forward for your reply! Snipaste_2022-11-06_16-42-53

adefossez commented 1 year ago

The transformer is trained a posteriori using a trained encoder / quantizer / decoder. The language model is not pretrained, it is only trained on the task of modeling the tokens from the underlying quantizer.

YWMditto commented 1 year ago

Thanks for your reply, but I want to ask what is the exact usage of this transformer module, how it can reduce bandwidth?

The transformer is trained a posteriori using a trained encoder / quantizer / decoder. The language model is not pretrained, it is only trained on the task of modeling the tokens from the underlying quantizer.

jhauret commented 1 year ago

If I can answer your question, it comes from Shannon's source coding theorem: given an alphabet of symbols, you can choose shorter codes for more likely symbols to reduce the overall length of the transmitted message, and thus bandwidth. The Transformer is used to estimate dynamically this code indices' probability in order to change the way they are encoded.

Also, I have another question: does the transformer is also trained in streaming mode (just like inference, with as many forwards as the encoded sequence length) or in classic seq2seq mode (with a single forward path)?

adefossez commented 1 year ago

The transformer is not trained in streaming mode, although we use some tricks to make it more compatible with streaming, e.g. we use random initial offset in the positional embedding and we limit the receptive field into the past. Then it is used in streaming mode, in particular in the decoder, as you must be able to decode the current token before you can make sense of the following bits.