Anygpt is trained only with the Next Token Prediction task.
Take text to image as an example,Is the training input speech tokens text tokens image tokens music tokens?
I want to know the input formats for training and inference.
training input :\<sos> speech tokens \<eos> text tokens \<soi> image tokens \<eoi> \<som> music tokens,
training label :speech tokens \<eos> text tokens \<soi> image tokens \<eoi> \<som> music tokens \<eom>. Is my understanding correct about training input and label?
Anygpt is trained only with the Next Token Prediction task. Take text to image as an example,Is the training input speech tokens text tokens image tokens music tokens?
I want to know the input formats for training and inference.
training input :\<sos> speech tokens \<eos> text tokens \<soi> image tokens \<eoi> \<som> music tokens,
training label :speech tokens \<eos> text tokens \<soi> image tokens \<eoi> \<som> music tokens \<eom>. Is my understanding correct about training input and label?