Closed ax7e closed 10 months ago
Thank you for your appreciation for our work.
Our SkyPile is a combination of data from multiple sources. We have carried out inter-source deduplication and intra-source deduplication multiple times. As data from different sources have different deduplication ratio, e.g., almost no duplication among Books, while very high duplication among webpages, it is hard to estimate an "overall" deduplication ratio for the entire deduplication pipeline.
We have removed more than 90% of the webpage data due to low quality or duplication.
Thank you for your dedication to transparency and integrity in your work. I am particularly intrigued to inquire about the rate of deduplication observed throughout the pipeline process of your corpus management.