deepseek-ai / DeepSeek-Math

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
MIT License
783 stars 46 forks source link

About raw common crawl data #12

Open jordane95 opened 6 months ago

jordane95 commented 6 months ago

Hi,

I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.