Question: about how the data is fed in for Decoder only models

From what I understand there seems to be 3 schools of thought when feeding the models data. you structure it like So I guess this is called the multi turn based way, I would believe this is called Next sentence prediction ?

x: <start token>, msg1 y:msg 2 <end of sentence>
x1:<start token> msg1, msg2  y:msg3 <end of sentence>

... so on Then there is this minor variation assuming the conversation is only 3 messages you only apply EOS at the end of the conversation

x: <start token>, msg1 y:msg 2 
x1:<start token> msg1, msg2  y:msg3 <end of sentence>

Then the teacher forcing way assuming 3 messages

x:<start token> msg1, msg2  y:msg3 <end of sentence>

y: msg1, msg2 ,msg3 <end of sentence>

I am wondering what you guys do with Parlai and which one is better in your oppinons ? @stephenroller @klshuster

Thanks! from what I understand the third one allows for longer generations but requires parsing since you can generate your speech. While the first one only generates next sentence.

facebookresearch / ParlAI

Question: about how the data is fed in for Decoder only models #4931