commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

can I get the index table data from https:// rather than s3:// ? #1

Closed imfht closed 6 years ago

sebastian-nagel commented 6 years ago

Yes and no. The data can be fetched via https://, e.g., s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet by changing the URL to https://commoncrawl.s3.amazonaws.com/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/subset=warc/part-00279-c947edd5-0324-4dae-9b8a-fb841dbf6a1a.c000.gz.parquet. However, the file names are not easy to predict. So you need the AWS Command Line Interface to get the listings, cf. commoncrawl/news-crawl#17:

aws --no-sign-request s3 ls --recursive s3://commoncrawl/cc-index/table/cc-main/warc/crawl=CC-MAIN-2018-30/

But thanks for the hint, we'll provide listings for the Parquet index as well in future crawls.

Please consider to ask further questions in the Common Crawl group. Thanks, @imfht!

imfht commented 6 years ago

Thanks so much~ @sebastian-nagel