commoncrawl / cc-index-table

Index Common Crawl archives in tabular format
Apache License 2.0
106 stars 9 forks source link

How to use AWS Athena to query CC-NEWS data ? #24

Open vansenic opened 1 year ago

vansenic commented 1 year ago

Overview:

I want to query something in the CC-NEWS, but in this paper: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/, all data in //s3:commoncrawl/cc-index/table/cc-main/warc/.

My Question:

How to use AWS Athena to query CC-NEWS data ?

Or differentiate news from //s3:commoncrawl/cc-index/table/cc-main/warc/?

sebastian-nagel commented 1 year ago

Unfortunately, there is yet no index for the news dataset.