facebookresearch / MobileLLM

MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.
Other
925 stars 47 forks source link

Explain data preparation strategies #10

Open Atharva-Phatak opened 1 month ago

Atharva-Phatak commented 1 month ago

I went through your codebase, could you please redirect me as to how have you prepared data in aspects of tokenization ? Did you add a bos token ? Did you concat sequences together with an eos token ?

An example would be amazing.