While going through the paper again, just got curious about teacher-forcing gt outside the mask.
So my understanding is generating token as it is done in the provided code for the parts that are not masked. But what about the parts that are not masked? are they supposed to be initialized to 0? Not quite sure how the parts within mask can be generated without class info.
While going through the paper again, just got curious about teacher-forcing gt outside the mask.
So my understanding is generating token as it is done in the provided code for the parts that are not masked. But what about the parts that are not masked? are they supposed to be initialized to 0? Not quite sure how the parts within mask can be generated without class info.
Thanks in advance!