deepseek-ai / DeepSeek-Math

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
MIT License
821 stars 51 forks source link

Question about the way to extract text from CC HTML #18

Open voladorlu opened 6 months ago

voladorlu commented 6 months ago

Hi guys @DeepSeekPH , thanks so much for sharing such an excellent work. I note that Openwebmath uses a specialized pipeline to extract content from HTML instead of directing using the WET file from Common Crawl. I just wonder how you guys deal with this problem? Do you also follow openwebmath to process the html with a private diagram? sincerely wait for your feedback.