Create first filtered index for use in MVP

vinceecws commented 3 years ago

The aim is to create a filtered version of the CCIndex: 1) To minimize processing time by having to query the index every time the application is run 2) To speed up the process through which all teams arrive at their first MVP (query ONE crawl) 3) To gauge the feasibility of achieving the stretch goal of working with all crawls since 2018-present by project due date

The idea is to filter by these criteria:

crawl = CC-MAIN-2021-10
subset = warc
fetch_status = 200
url_path = (career[s], job[s]) && (tech keywords...)
top_level_domain = com
content_languages = eng
content_mime_type = text/html

Prior to processing individual WARC records — which is significantly more expensive.

Good tech keywords to filter for include:

devops
database
program
network
sql
informatics
ios
oracle
unix
java
python
aspnet
cobol
developer
programmer
web-development
.
.
.

vinceecws commented 3 years ago

First attempt by #67

Ahimsaka commented 3 years ago

CSVs written here: s3://maria-1086/Devin-Testing/outputs/test-write/ The code took 1 hour and 1 minute to run on EMR, which is a good indication that our stretch goal will be feasible.

1086-Maria-Big-Data / JobAdAnalytics

Create first filtered index for use in MVP #69