GAIR-NLP / MathPile

[NeurlPS D&B 2024] Generative AI for Math: MathPile
https://gair-nlp.github.io/MathPile/
Apache License 2.0
381 stars 20 forks source link
corpus language-model large-language-models math pre-training

Generative AI for Math: MathPile

This is the official repository for Generative AI for Math: Part I - MathPile: A Billion-Token-Scale Pretraining Corpus for Math

Homepage | Datasets | Paper | Limitations | Statement & License | Citation | Featured By AK

Please be aware that our corpus could be updated (we will notify upon release). It is advisable to use the latest version.

🔥News

🚀Introduction

High-quality, large-scale corpora are the cornerstone of building powerful foundation models. In this work, we introduce MathPile a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens. our work is significantly different from the previous work in the following characteristics:

We hope our MathPile can help to enhance the mathematical reasoning abilities of language models. See our paper for more technical details. ## 😋Limitations - The decisions made during the data collection and processing phases might not always be optimal. - Some documents in MathPile may not always be of the highest quality. We are committed to continually refining and optimizing this corpus. ## 👊Statements & License - These invaluable corpora are the culmination of human intellect and should be utilized for the betterment of humanity, aiding in the improvement of human life. **We strongly urge all users to refrain from using our corpus for any activities that may harm national or social security or violate the law.** - We have done our utmost to ensure the high quality and lawful use of the data. However, unforeseen issues may still arise, including but not limited to data security concerns and any risks or problems stemming from misuse. We shall not be held responsible for any such issues. If the source data of MathPile is governed by a license more restrictive than [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en), MathPile adheres to that stricter licensing. In all other cases, it operates under the [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en) license. We also plan to release a commercially usable version of the dataset soon. ## 🌟Projects Using MathPile Below are some projects that use MathPile, covering scenarios including but not limited to pre-training, data synthesis, and benchmarking: - [Quality or Quantity? Comparing Domain-Adaptive Pre-training Approaches for Language Models with Mathematical Understanding](https://web.stanford.edu/class/cs224n/final-reports/256838758.pdf) [Stanford CS224N Custom Project] - [JiuZhang3.0: Efficiently Improving Mathematical Reasoning by Training Small Data Synthesis Models](https://arxiv.org/abs/2405.14365) - [Task Oriented In-Domain Data Augmentation](https://arxiv.org/abs/2406.16694) - [Great Memory, Shallow Reasoning: Limits of $k$NN-LMs](https://arxiv.org/abs/2408.11815) - [BAM! Just Like That: Simple and Efficient Parameter Upcycling for Mixture of Experts](https://arxiv.org/abs/2408.08274) - ... ## 🥳Citation If you find our work useful or use MathPile, please cite our paper: ``` @article{wang2023mathpile, title={Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale Pretraining Corpus for Math}, author={Wang, Zengzhi and Xia, Rui and Liu, Pengfei}, journal={arXiv preprint arXiv:2312.17120}, year={2023} } ```