bigscience-workshop / data_tooling

Tools for managing datasets for governance and training.
Apache License 2.0
74 stars 48 forks source link

Create license-compliant version of the Pile: Stack Exchange #376

Closed albertvillanova closed 2 years ago

albertvillanova commented 2 years ago

DONE: https://huggingface.co/datasets/bigscience-catalogue-lm-data/lm_en_the_pile_stack_exchange

Sample:


{'text': "Q:\n\nWhat is the h-index exactly and how does it work?\n\nWhat is the h-index, and how does it work ?\n\nA:\n\nThe h-index is a measure of the impact of someone's publication list. An h-index of 10 for example means that the person has published 10 papers with at least 10 citations. The total number of papers published may be higher, but only 10 will have 10 or more citations.\nCritics argue that this measure disadvantages young researchers who did not have time to publish a lot and whose work has not been published for long and thus may not have attracted many citations. Other criticisms include that it makes a researcher focus on how to increase the citation count for a paper that may be not that good but would increase the h-index.\nFor more explanation, see for example the Wikipedia article.",
 'meta': "{'file': 'academia.stackexchange_0000000005.txt'}"}