DKPro C4CorpusTools is a collection of tools for processing CommonCrawl corpus, including Creative Commons license detection, boilerplate removal, language detection, and near-duplicate removal.
As announced in CC mailing list, CC is moving within AWS:
For users of the data, this means that the path to access any data in the corpus, from https or S3, is modified because the data has been moved to a new bucket (location) on AWS S3. Going forward, all Common Crawl data is accessible below https://commoncrawl.s3.amazonaws.com/ or s3://commoncrawl/.
For the next few weeks, the entire corpus will be available at both the old and new locations. During this time, all links on the Common Crawl website that point to datasets in the corpus will be updated to point to the new location.
This group will receive a reminder of this change and notification when the paths to the previous location are no longer active.
The first new dataset shared at the new location is the April crawl (s3://commoncrawl/crawl-data/CC-MAIN-2016-18/). Detail on the crawl archive of April 2016 is posted here on the Common Crawl blog. (Please note that the April crawl is not available at the old location.)
As announced in CC mailing list, CC is moving within AWS: