The aim is to create a filtered version of the CCIndex:
1) To minimize processing time by having to query the index every time the application is run
2) To speed up the process through which all teams arrive at their first MVP (query ONE crawl)
3) To gauge the feasibility of achieving the stretch goal of working with all crawls since 2018-present by project due date
CSVs written here: s3://maria-1086/Devin-Testing/outputs/test-write/
The code took 1 hour and 1 minute to run on EMR, which is a good indication that our stretch goal will be feasible.
The aim is to create a filtered version of the CCIndex: 1) To minimize processing time by having to query the index every time the application is run 2) To speed up the process through which all teams arrive at their first MVP (query ONE crawl) 3) To gauge the feasibility of achieving the stretch goal of working with all crawls since 2018-present by project due date
The idea is to filter by these criteria:
Prior to processing individual WARC records — which is significantly more expensive.
Good tech keywords to filter for include: