Atharva-Phatak / Analysing-Glassdoor-Jobs

Data Analysis of Job Postings on Glassdoor.
MIT License
41 stars 11 forks source link
data-scraping glassdoor

Glassdoor Jobs Data-Analysis

I came up with this personal personal project to test my skills to the fullest and learn new things. In this project I scraped job postings related to the position of 'Data Scientist' from glassdoor.com, analyzed the gathered data and framed a machine learning problem out of it. In the below write up I'll mention the details on what I learned. I selected states of California, Washington, New York as major areas to find the roles.

Please read the readme.md file to understand what I found.

The project consists of three main jupyter notebooks

  1. The first notebook : scrape_data.ipynb is a python script for scraping job postings from glassdoor.com
  2. The second notebook : glassdoor_eda.ipynb is the notebook contaning exploratory data analysis and contains the insights which I was able to glean from it.
  3. The third notebook :modelling.ipynb contains a basic machine learning model to solve the framed problem and most importantly it includes sections on 'Machine Learning Explainability'

About Glassdoor

glass

Glassdoor is a website where current and former employees anonymously review companies. Glassdoor also allows users to anonymously submit and view salaries as well as search and apply for jobs on its platform.Glassdoor launched its site in 2008 , as a site that “collects company reviews and real salaries from employees of large companies and displays them anonymously for all members to see,” according to TechCrunch. The company then averaged the reported salaries, posting these averages alongside the reviews employees made of the management and culture of the companies they worked for—including some of the larger tech companies like Google and Yahoo. The site also allows the posting of office photographs and other company-relevant media.

Stage I : Data Scraping

In this part of the project I developed a webscraper which scrapes data from glassdoor.com. Here's how I went about creating it.

  1. The most important part of web scraping is understanding the website which your are scraping and by understanding I meant looking at the source code of the website in your browser.(I spent about 2 days understanding the structure of the website and locating the elements which I needed to find.
  2. The elements which I was gonna scrape were Company name, Job Title, Salary, Ratings, Job Description etc. The easiest part was to scrape the elements except job description because to see the whole job description one has to click on the company tab which contains job description. img

There were two approaches to deal with it :

  1. Once I extracted all the links from all the pages that were present on the glassdoor.com. I figured out how many jobs were present on each page which turned out to be 30.

  2. Then I went to every exracted link using selenium got the page source code using a beautiful soup object and extracted the required elements.

    • Interesting thing was all the details like job title,Company name,salary,etc were present in a Json file under tag, so I extracted the content of the json and got most of my info. img
    • I found another class_ id for
      tag which contained job description so I used it to extract the job description. After extracting the information I stored it in various .csv files according the search query.

Stage II : Exploratory Data Analysis a.k.a EDA

First of all I used interactive plots in it. I used plotly and didn't use the offline version to create the plots sorry if you cannot see the plots, If you want to see the notebook with the plots here is link : eda notebook

I ran all of my code on deepnote, you should look them up.

1) When I read all the data from the CSV files. I found that it contained duplicated rows, so my first task was to delete them. After that pretty much the data was clean because of the good scraping we did.

2) Beginning of EDA

There are 12 columns in the data they are as follows:

  1. Job_title: The title of job which you are applying to
  2. Company : Company name
  3. State/City : State/City in which the companies job posting is listed.
  4. Min_Salary : Minimum yearly salary in USD.
  5. Max_Salary : Maximum yearly salary in USD.
  6. Job_Desc : The job description which included skills,requirements,etc
  7. Industry : The industry in which the company works.
  8. Date_posted : The date on which the job was posted on glassdoor
  9. Valid_until: The last date of applying to the job. 10.Job_Type : Type of job full-time , part-time,etc. 11.Rating : Rating of the company

a. State column

b. Industry Column

img

c. **Exploring the Company Columns***

alt

alt

alt

alt

alt

* alt

d. Job Titles

alt

e. Job Description

alt

Stage III : Modelling

I have exhaustive explanation in the modelling notebook you can look at it there. There are explanations for every plot and method I used.

If you liked reading this please show your appreciation by giving my work some stars.

Requirements