1086-Maria-Big-Data / JobAdAnalytics

3 stars 2 forks source link

Job Ad Analytics Project

General Info

This project makes use of the free to use common craw datasets housed on Amazon Web Services to fulfill client questions related to job market advertisements.

Description

Job Ad Analytics makes use of common crawl data (https://commoncrawl.org/the-data/get-started/) as a data source for analyizing job ads across the internet. Comman crawl is a universally accessible and analyzable repository of web crawl data that includes raw web page data, extracted metadata, and text extractions. The project's ability to handle the vast amounts of data provided by common crawl is achieved by running the spark application on AWS-EMR. Job Ad Analytics makes use of this data by querying the common crawl index for URLs containing job/s and career/s in order to filter out any non-job advertisments within each of the crawls. The program then looks through each segment within a crawl to access the WARC files which contain WARC objects that hold the data and metadata for every webpage we identified as an advertisemnt from the index. From there, each web page pertaining to job advertisements is analyzed and queried to fulfill the client's requirements.

More Indepth Descriptions

To run on EMR:

File hierarchy within each common crawl crawl

File types within each common crawl crawl