dataset: some datasets of YaLM 100B [ru + en]

echo0x22 commented 1 year ago

There are some list of datasets that were used to train YaLM 100B. They haven't shared yet with final dataset (and I doubt they will), but they have a list of used ones on their GitHub page.

[ru] Omnia Russica

Omnia Russica is combining major Russian corpus sources within one pipeline.

	Format	Morphology	Syntax	Size
Wikipedia	vertical	TreeTagger	None	0.5 G
Taiga	CoNLL-U	UDpipe	UDpipe	4.5 G
Araneum Russicum	vertical	TreeTagger	None	25 G
Common Crawl	Plain text	None	None	3 G

Can be found there: https://omnia-russica.github.io/

[ru] Human and Machine Judgements about Russian Semantic Relatedness

Contains several open language resources for semantic relatednes in Russian language. Presented five semantic relatedness resources for Russian, each being a list of triples (word_i, word_j, similarity_ij). Four of them are designed for evaluation of semantic relatedness, each complementing another in terms of relation type. These benchmarks were used in a shared task on Russian semantic similarity. One of the best systems was used to generate the fifth resource – an open distributional thesaurus of Russian. Multiple evaluations of this thesaurus indicate its state-of-the-art quality.

HJ: Human Judgements of Word Pairs RT: Synonyms and Hypernyms from the Thesaurus RuThes AE: Cognitive Associations from the Sociation.org Experiment MJ: Machine Judgements of Word Pairs from the RUSSE Shared Task RDT: Russian Distributional Thesaurus

Can be found there: https://russe.nlpub.org/downloads/

[en] The Pile

The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.

Can be found there: https://pile.eleuther.ai/

Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):

25% The Pile — open English dataset by Eleuther AI team

75% Texts in Russian collected by our team (percentages of the whole dataset are given)

49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:

LSH Deduplication — clusters of similar texts were truncated to just one text each Length filtration — too short or too long texts or texts with too few natural sentences were discarded. Entropy filtration — texts with too high or too low entropy were discarded Domain filtration — domains with repetitive texts (like online retail) were discarded Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded 12% News from various sources from Yandex Search index

10% Books from the dataset used in Russian Distributional Thesarus

3% Misc texts from the Taiga Dataset

1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile

0.5% Russian portion of Wikipedia

Some subsets were traversed up to 3 times during the training.

Everything is open-source.

echo0x22 commented 1 year ago

Forgot to mention that data there's not prepared for plain import, (we'd rather need to make our own model to import things like this plain), but for some it can be automatically done, example: Насколько процентов {word1} имеет отношение к {word2}? or Синонимы ли эти слова: {word1}, {word2}

echo0x22 commented 1 year ago

Will work on it in this: https://github.com/LAION-AI/Open-Assistant/issues/3122

LAION-AI / Open-Assistant