Closed echo0x22 closed 1 year ago
Forgot to mention that data there's not prepared for plain import, (we'd rather need to make our own model to import things like this plain), but for some it can be automatically done, example: Насколько процентов {word1} имеет отношение к {word2}?
or Синонимы ли эти слова: {word1}, {word2}
Will work on it in this: https://github.com/LAION-AI/Open-Assistant/issues/3122
There are some list of datasets that were used to train YaLM 100B. They haven't shared yet with final dataset (and I doubt they will), but they have a list of used ones on their GitHub page.
[ru] Omnia Russica
Omnia Russica is combining major Russian corpus sources within one pipeline.
Can be found there: https://omnia-russica.github.io/
[ru] Human and Machine Judgements about Russian Semantic Relatedness
Contains several open language resources for semantic relatednes in Russian language. Presented five semantic relatedness resources for Russian, each being a list of triples (word_i, word_j, similarity_ij). Four of them are designed for evaluation of semantic relatedness, each complementing another in terms of relation type. These benchmarks were used in a shared task on Russian semantic similarity. One of the best systems was used to generate the fifth resource – an open distributional thesaurus of Russian. Multiple evaluations of this thesaurus indicate its state-of-the-art quality.
HJ: Human Judgements of Word Pairs RT: Synonyms and Hypernyms from the Thesaurus RuThes AE: Cognitive Associations from the Sociation.org Experiment MJ: Machine Judgements of Word Pairs from the RUSSE Shared Task RDT: Russian Distributional Thesaurus
Can be found there: https://russe.nlpub.org/downloads/
[en] The Pile
The Pile is a 825 GiB diverse, open source language modelling data set that consists of 22 smaller, high-quality datasets combined together.
Can be found there: https://pile.eleuther.ai/
Dataset used for the training of YaLM-100B is comprised of the following parts (rough percentages are measured in tokens seen by the model):
25% The Pile — open English dataset by Eleuther AI team
75% Texts in Russian collected by our team (percentages of the whole dataset are given)
49% Russian web pages from Yandex Search index filtered from ~100Tb to ~1Tb by the following heuristics:
LSH Deduplication — clusters of similar texts were truncated to just one text each Length filtration — too short or too long texts or texts with too few natural sentences were discarded. Entropy filtration — texts with too high or too low entropy were discarded Domain filtration — domains with repetitive texts (like online retail) were discarded Classifier filtration — dataset of good texts was collected in a manner similar to WebText from pages linked in tweets in Russian that have at least one reply. Then a classifier was trained to distinguish those good texts from random pages from the dataset. Texts from the original crawled dataset with low classifier scores were then discarded 12% News from various sources from Yandex Search index
10% Books from the dataset used in Russian Distributional Thesarus
3% Misc texts from the Taiga Dataset
1.5% Dialogues from social media preprocessed in a manner similar to how Reddit is proccessed in The Pile
0.5% Russian portion of Wikipedia
Some subsets were traversed up to 3 times during the training.
Everything is open-source.