bptt concat different input sentences together to make a training instance of length bptt;
one input sentence could be split into to different training instances;
how will this effect training ?
what will the monolingual and bilingual mask lm loss be when training converged?
bptt concat different input sentences together to make a training instance of length bptt; one input sentence could be split into to different training instances; how will this effect training ?
what will the monolingual and bilingual mask lm loss be when training converged?
thanks