About raw common crawl data

Hi,

I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.

deepseek-ai / DeepSeek-Math

About raw common crawl data #12