I want to query something in the CC-NEWS, but in this paper: https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/, all data in //s3:commoncrawl/cc-index/table/cc-main/warc/.
My Question:
How to use AWS Athena to query CC-NEWS data ?
Or differentiate news from //s3:commoncrawl/cc-index/table/cc-main/warc/?
Overview:
I want to query something in the CC-NEWS, but in this paper:
https://commoncrawl.org/2018/03/index-to-warc-files-and-urls-in-columnar-format/
, all data in//s3:commoncrawl/cc-index/table/cc-main/warc/
.My Question:
How to use AWS Athena to query CC-NEWS data ?
Or differentiate news from
//s3:commoncrawl/cc-index/table/cc-main/warc/
?