facebookresearch / XLM

PyTorch original implementation of Cross-lingual Language Model Pretraining.
Other
2.87k stars 495 forks source link

Language model training data #329

Open sbmaruf opened 3 years ago

sbmaruf commented 3 years ago

So far I understand that "language model is trained with the stream of text". That means there is no grammatical boundary of sentence start and end (i.e., full stop (.), exclam mark (!)). I was wondering if there are any noise-induced by this.

So my question is if I train a language model with/without sentence boundary may I expect to see any difference

  1. In downstream task adaptation
  2. In text generation

@glample @aconneau