PSavvateev / JobScrapingApp_Indeed.nl

Web scraper to get information about posted jobs in the Netherlands from Indeed.nl
2 stars 0 forks source link

Not showing all results #2

Closed mrprez083 closed 2 years ago

mrprez083 commented 2 years ago

Everything is fine but for someone reason there is a discrepancy between the #jobs a company has listed and the number returned in the data dump csv. For example, apple has 9K open jobs, yet my data dump consistently yields only 1000.

PSavvateev commented 2 years ago

Can you please share your search parameters, I will try to replicate

mrprez083 commented 2 years ago

Well I tweaked the code for it to work on the United States version of indeed, So I'm not sure how much it will help. Has this error occurred with the NL version for anyone?

PSavvateev commented 2 years ago

US site has a bit different HTML-design, but logic is the same. Which key-word did you use for the search?

mrprez083 commented 2 years ago

It's a really large list but I will post it below. There should be upwards of 90k+ jobs but the final scrape only yields around 15k. Here's the list: "Apple", "Intel", "NVIDIA", "Arrow-Electronics", "SiTime", "HP", "BestBuy", "Avnet", "Axcelis-Technologies", "Cadence-Design-Systems", "Celestia", "Plexus", "Jabil", "Flex", "CDW", "TD-Synnex", "SQI-Diagnostics", "Energous", "Himax", "Knowles", "Impinj", "Cirrus-Logic", "Sequans-Communications", "MaxLinear", "QUALCOMM", "American-Superconductor", "Synaptics", "Ambarella", "Advanced-MicroDevices", "Lattice-SemiconductorCorp", "O2Micro", "Analog-Devices", "Texas-Instruments", "Allegro-MicroSystems", "MACOMTechnologySolutions", "Silicon-Laboratories", "Semtech", "Microchip-Technology", "Monolithic-PowerSystem", "Power-Integrations", "Broadcom", "Skyworks", "Qorvo", "Western-Digital", "Micron", "Infineon", "STMicroelectronics", "Littelfuse", "ON-Semiconductor", "Vishay-Intertechnology", "NXP", "Diodes", "Alpha-Omega-Semiconductor", "TE-Connectivity", "Amphenol", "Sensata", "Entegris", "NOVA", "Ichor", "VAT-Group", "Lam-Research", "Advanced-Energy", "Applied-Materials", "KLA", "Ultra-Clean", "ASML", "MKS-Instruments", "National-Instruments", "Keysight-Technologies", "Teradyne", "FormFactor", "Tower-Semiconductor", "ASE-Technology", "TaiwanSemiconductorManufacturing", "United-Microelectronics", "Amkor", "SkyWater", "Thunderstruck-Resources", "Arteris-IP", "Credo-Semiconductor", "Synopsys", "PDF-Solutions", "ANSYS", "Universal-Display", "Ceva", "Sanmina", "IQE", "Acuity-Brands", "Aixtron-SE", "Emerson-Electric", "Deere-Company", "Air-Products-Honeywell", "General-Electric", "Dover", "Aehr", "Veeco-Instruments", "DuPont", "Wolfspeed", "FN", "Viavi", "II-VI", "Ciena", "Infinera", "Lumentum", "IPG-Photonics", "Cognex-Corporation", "Amkor", "Nlight", "Hewlett-Packard-Enterprise", "IBM", "Dell", "Super-MicroComputer", "Nutanix", "CommVault-Systems", "NetApp", "Pure-Storage", "Digi", "HubGroup", "Globalstar", "JBHunt", "Viasat", "CalAmp", "Iridium-Communications", "Seagate-Technology", "T-Mobile", "Verizon", "AT%26T", "Corning", "Garmin", "GoPro", "Logitech", "Zebra-Technologies", "Peloton", "BlackBerry", "Nokia", "F5", "Cisco", "Juniper-Networks", "Arista-Networks", "Xilinx","Onto-Innovation", "Solaris-Infrastructure"

PSavvateev commented 2 years ago

Ok, I've just had a look at the original US site. I tried only ' NVIDIA' as a search parameter. IN the first page it says ' 1,321 jobs | page 1 of 132' . Supposed to be 132 pages . However, if you click page by page, you end up at page 34 without any further options. So if you do scraping for 'NVIDIA" 34 pages * 15 jobs per page = 510 found jobs only. I think, Indeed.com applies algorithms to hide potential duplicates. But it is quite confusing. And the same for any search: number of published jobs is much lower than evaluated on the first page.

PSavvateev commented 2 years ago

Also, here is my version of the code to scrape from indeed.com (USA):

import requests
import time
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import re

import pandas as pd

source = "indeed.com"
cookies = {'aep_usuc_f': 'region=US&site=glo&b_locale=en_US&c_tp=USD'}

`

def get_url(position):
    """
    Generate URL from position and company type: recruiter or direct employer; education level
    """
    url = f"https://indeed.com/jobs?q={position}"

    return url
def get_job_date(card):
    """
     extracts date from the job post record

    :param card:
    :return:
    """
    post_str = card.find('span', {'class': 'date'}).text  # text from the footer: days ago was posted
    post_days = re.findall(r'\d+', post_str)  # extracting number of days from posted_str

    if post_days:
        # calculated date of job posting if days are mentioned
        job_date = (datetime.now() - timedelta(days=int(post_days[0]))).strftime("%d/%m/%Y")
    else:
        job_date = datetime.now().strftime("%d/%m/%Y")  # if days are not mentioned - using today

    return job_date
def get_job_salaries(card):
    """
    extracts salaries
    :param card:
    :return:
    """

    try:
        salary_str = card.find('div', 'metadata salary-snippet-container').text
        salaries = re.findall(r"\b(\w+[.]\w+)", salary_str)

    except AttributeError:
        salaries = []

    return salaries
def get_record(card):
    """
    Extract job data from a single record
    """
    span_tag = card.h2.a.span
    a_tag = card.h2.a

    job_id = a_tag.get("data-jk")  # unique job id
    job_title = span_tag.get("title")  # job title
    job_url = 'https://www.indeed.com' + a_tag.get('href')  # job url
    company_name = card.find('span', {'class': 'companyName'}).text  # company name
    job_loc = card.find('div', {'class': 'companyLocation'}).text  # job location
    job_summary = card.find('div', {'class': 'job-snippet'}).text.strip()  # job description
    job_date = get_job_date(card)  # job posting date
    job_salary = get_job_salaries(card)  # job salaries if any

    record = (job_id, job_title, job_date, job_loc, job_summary, job_salary, job_url, company_name)

    return record
def get_jobs(position):
    """
    creates a DataFrame with all records (scraped jobs), scraping from all pages

    """

    url = get_url(position)
    records = []

    # extract the job data

    while True:

        response = ""
        while response == "":
            try:
                response = requests.get(url=url, cookies=cookies)
                break
            except ConnectionError:
                print("Connection refused by the server..")
                print("Let me sleep for 5 seconds")
                print("ZZzzzz...")
                time.sleep(5)
                print("Was a nice sleep, now let me continue...")
                continue

        soup = BeautifulSoup(response.text, 'html.parser')

        cards = soup.find_all('div', 'job_seen_beacon')

        for card in cards:
            record = get_record(card)
            records.append(record)

        time.sleep(3)  # making a pause before moving to the next page

        # moving to the next page - > assigning a new url
        try:
            url = 'https://indeed.com/' + soup.find('a', {'aria-label': 'Next'}).get('href')

        except AttributeError:
            break

#save the data as DF
    columns = ['job_id',
               'job_title',
               'job_date',
               'job_loc',
               'job_summary',
               'job_salary',
               'job_url',
               'company_name']
    df = pd.DataFrame(data=records, columns=columns)

    # adding to DF columns with search parameters
    search_time = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")

    df.insert(loc=6, column="job_education", value="All")
    df.insert(loc=9, column="company_type", value="All")
    df.insert(loc=10, column="search_time", value=search_time)
    df.insert(loc=11, column="search_position", value=position)
    df.insert(loc=12, column="source", value=source)

    return df