Closed imfht closed 6 years ago
Yes and no. The data can be fetched via https://, e.g., s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet by changing the URL to https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet. However, the file names are not easy to predict. So you need the AWS Command Line Interface to get the listings, cf. commoncrawl/news-crawl#17:
aws --no-sign-request s3 ls --recursive s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/
But thanks for the hint, we'll provide listings for the Parquet index as well in future crawls.
Please consider to ask further questions in the Common Crawl group. Thanks, @imfht!
Thanks so much~ @sebastian-nagel
Yes and no. The data can be fetched via https://, e.g., s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet by changing the URL to https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet. However, the file names are not easy to predict. So you need the AWS Command Line Interface to get the listings, cf. commoncrawl/news-crawl#17:
But thanks for the hint, we'll provide listings for the Parquet index as well in future crawls.
Please consider to ask further questions in the Common Crawl group. Thanks, @imfht!