I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.
Hi,
I'm trying to reproduce your paper. However, I find that many math-related contents are filtered out in many popular text extraction pipeline. I'm wondering which version of the common crawl data you used to mined high-quality math contents? Did you use the custom pipeline for web data processing or something more specific? I cannot find any details regarding this in your paper.