The LTI LangID corpus is a dataset for language identification. The most recent version, v5, contains training data for 1266 languages, and some (possibly very tiny) amount of text for a total of 1706 languages. The corpus defines a train:test split, and, when the language in question has a sufficient amount of data, a dev split as well.
Dataloader name:
lti_langid_corpus/lti_langid_corpus.py
DataCatalogue: http://seacrowd.github.io/seacrowd-catalogue/card.html?lti_langid_corpus