This repo contains the code, data, and models for NeurIPS-24 paper "MAmmoTH2: Scaling Instructions from the Web". Our paper proposes a new paradigm to scale up high-quality instruction data from the web.
We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.
Part of our WebInstruct dataset has been released at π€ TIGER-Lab/WebInstructSub and π€ TIGER-Lab/WebInstructFull.
Please refer to https://tiger-ai-lab.github.io/MAmmoTH2/ for more details.
Please refer to https://github.com/TIGER-AI-Lab/MAmmoTH2/tree/main/math_eval.
Please cite our paper if you use our data, model or code. Please also kindly cite the original dataset papers.
@article{yue2024mammoth2,
title={MAmmoTH2: Scaling Instructions from the Web},
author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
journal={Advances in Neural Information Processing Systems},
year={2024}
}