TIGER-AI-Lab / MAmmoTH2

Official code for "MAmmoTH2: Scaling Instructions from the Web" [NeurIPS 2024]
https://tiger-ai-lab.github.io/MAmmoTH2/
MIT License
124 stars 9 forks source link
language math reasoning

MAmmoTH2

This repo contains the code, data, and models for NeurIPS-24 paper "MAmmoTH2: Scaling Instructions from the Web". Our paper proposes a new paradigm to scale up high-quality instruction data from the web.

πŸ”₯ πŸ”₯ πŸ”₯ Check out our [Project Page] for more results and analysis! Also, our Demo is online!

WebInstruct

We propose discovering instruction data from the web. We argue that vast amounts of high-quality instruction data exist in the web corpus, spanning various domains like math and science. Our three-step pipeline involves recalling documents from Common Crawl, extracting Q-A pairs, and refining them for quality. This approach yields 10 million instruction-response pairs, offering a scalable alternative to existing datasets. We name our curated dataset as WebInstruct.

Part of our WebInstruct dataset has been released at πŸ€— TIGER-Lab/WebInstructSub and πŸ€— TIGER-Lab/WebInstructFull.

Model Downloads

| **Model** | **Dataset** | **Init Model** | **Download** | | :------------: | :------------: | :------------: | :------------: | | MAmmoTH2-8x7B | WebInstruct | Mixtral-8x7B | [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8x7B) | | MAmmoTH2-7B | WebInstruct | Mistral-7B-v0.2| [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-7B) | | MAmmoTH2-8B | WebInstruct | Llama-3-base | [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8B) | | MAmmoTH2-8x7B-Plus | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-8x7B | [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8x7B-Plus) | | MAmmoTH2-7B-Plus | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-7B | [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-7B-Plus) | | MAmmoTH2-8B-Plus | WebInstruct + OpenHermes2.5 + CodeFeedback + Math-Plus | MAmmoTH2-8B | [πŸ€— HuggingFace](https://huggingface.co/TIGER-Lab/MAmmoTH2-8-Plus) |

Evaluation Results

Please refer to https://tiger-ai-lab.github.io/MAmmoTH2/ for more details.

Evaluation Command

Please refer to https://github.com/TIGER-AI-Lab/MAmmoTH2/tree/main/math_eval.

Cite our paper

Please cite our paper if you use our data, model or code. Please also kindly cite the original dataset papers.

@article{yue2024mammoth2,
  title={MAmmoTH2: Scaling Instructions from the Web},
  author={Yue, Xiang and Zheng, Tuney and Zhang, Ge and Chen, Wenhu},
  journal={Advances in Neural Information Processing Systems},
  year={2024}
}