Closed mrprez083 closed 2 years ago
Can you please share your search parameters, I will try to replicate
Well I tweaked the code for it to work on the United States version of indeed, So I'm not sure how much it will help. Has this error occurred with the NL version for anyone?
US site has a bit different HTML-design, but logic is the same. Which key-word did you use for the search?
It's a really large list but I will post it below. There should be upwards of 90k+ jobs but the final scrape only yields around 15k. Here's the list: "Apple", "Intel", "NVIDIA", "Arrow-Electronics", "SiTime", "HP", "BestBuy", "Avnet", "Axcelis-Technologies", "Cadence-Design-Systems", "Celestia", "Plexus", "Jabil", "Flex", "CDW", "TD-Synnex", "SQI-Diagnostics", "Energous", "Himax", "Knowles", "Impinj", "Cirrus-Logic", "Sequans-Communications", "MaxLinear", "QUALCOMM", "American-Superconductor", "Synaptics", "Ambarella", "Advanced-MicroDevices", "Lattice-SemiconductorCorp", "O2Micro", "Analog-Devices", "Texas-Instruments", "Allegro-MicroSystems", "MACOMTechnologySolutions", "Silicon-Laboratories", "Semtech", "Microchip-Technology", "Monolithic-PowerSystem", "Power-Integrations", "Broadcom", "Skyworks", "Qorvo", "Western-Digital", "Micron", "Infineon", "STMicroelectronics", "Littelfuse", "ON-Semiconductor", "Vishay-Intertechnology", "NXP", "Diodes", "Alpha-Omega-Semiconductor", "TE-Connectivity", "Amphenol", "Sensata", "Entegris", "NOVA", "Ichor", "VAT-Group", "Lam-Research", "Advanced-Energy", "Applied-Materials", "KLA", "Ultra-Clean", "ASML", "MKS-Instruments", "National-Instruments", "Keysight-Technologies", "Teradyne", "FormFactor", "Tower-Semiconductor", "ASE-Technology", "TaiwanSemiconductorManufacturing", "United-Microelectronics", "Amkor", "SkyWater", "Thunderstruck-Resources", "Arteris-IP", "Credo-Semiconductor", "Synopsys", "PDF-Solutions", "ANSYS", "Universal-Display", "Ceva", "Sanmina", "IQE", "Acuity-Brands", "Aixtron-SE", "Emerson-Electric", "Deere-Company", "Air-Products-Honeywell", "General-Electric", "Dover", "Aehr", "Veeco-Instruments", "DuPont", "Wolfspeed", "FN", "Viavi", "II-VI", "Ciena", "Infinera", "Lumentum", "IPG-Photonics", "Cognex-Corporation", "Amkor", "Nlight", "Hewlett-Packard-Enterprise", "IBM", "Dell", "Super-MicroComputer", "Nutanix", "CommVault-Systems", "NetApp", "Pure-Storage", "Digi", "HubGroup", "Globalstar", "JBHunt", "Viasat", "CalAmp", "Iridium-Communications", "Seagate-Technology", "T-Mobile", "Verizon", "AT%26T", "Corning", "Garmin", "GoPro", "Logitech", "Zebra-Technologies", "Peloton", "BlackBerry", "Nokia", "F5", "Cisco", "Juniper-Networks", "Arista-Networks", "Xilinx","Onto-Innovation", "Solaris-Infrastructure"
Ok, I've just had a look at the original US site. I tried only ' NVIDIA' as a search parameter. IN the first page it says ' 1,321 jobs | page 1 of 132' . Supposed to be 132 pages . However, if you click page by page, you end up at page 34 without any further options. So if you do scraping for 'NVIDIA" 34 pages * 15 jobs per page = 510 found jobs only. I think, Indeed.com applies algorithms to hide potential duplicates. But it is quite confusing. And the same for any search: number of published jobs is much lower than evaluated on the first page.
Also, here is my version of the code to scrape from indeed.com (USA):
import requests
import time
from datetime import datetime, timedelta
from bs4 import BeautifulSoup
import re
import pandas as pd
source = "indeed.com"
cookies = {'aep_usuc_f': 'region=US&site=glo&b_locale=en_US&c_tp=USD'}
`
def get_url(position):
"""
Generate URL from position and company type: recruiter or direct employer; education level
"""
url = f"https://indeed.com/jobs?q={position}"
return url
def get_job_date(card):
"""
extracts date from the job post record
:param card:
:return:
"""
post_str = card.find('span', {'class': 'date'}).text # text from the footer: days ago was posted
post_days = re.findall(r'\d+', post_str) # extracting number of days from posted_str
if post_days:
# calculated date of job posting if days are mentioned
job_date = (datetime.now() - timedelta(days=int(post_days[0]))).strftime("%d/%m/%Y")
else:
job_date = datetime.now().strftime("%d/%m/%Y") # if days are not mentioned - using today
return job_date
def get_job_salaries(card):
"""
extracts salaries
:param card:
:return:
"""
try:
salary_str = card.find('div', 'metadata salary-snippet-container').text
salaries = re.findall(r"\b(\w+[.]\w+)", salary_str)
except AttributeError:
salaries = []
return salaries
def get_record(card):
"""
Extract job data from a single record
"""
span_tag = card.h2.a.span
a_tag = card.h2.a
job_id = a_tag.get("data-jk") # unique job id
job_title = span_tag.get("title") # job title
job_url = 'https://www.indeed.com' + a_tag.get('href') # job url
company_name = card.find('span', {'class': 'companyName'}).text # company name
job_loc = card.find('div', {'class': 'companyLocation'}).text # job location
job_summary = card.find('div', {'class': 'job-snippet'}).text.strip() # job description
job_date = get_job_date(card) # job posting date
job_salary = get_job_salaries(card) # job salaries if any
record = (job_id, job_title, job_date, job_loc, job_summary, job_salary, job_url, company_name)
return record
def get_jobs(position):
"""
creates a DataFrame with all records (scraped jobs), scraping from all pages
"""
url = get_url(position)
records = []
# extract the job data
while True:
response = ""
while response == "":
try:
response = requests.get(url=url, cookies=cookies)
break
except ConnectionError:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
continue
soup = BeautifulSoup(response.text, 'html.parser')
cards = soup.find_all('div', 'job_seen_beacon')
for card in cards:
record = get_record(card)
records.append(record)
time.sleep(3) # making a pause before moving to the next page
# moving to the next page - > assigning a new url
try:
url = 'https://indeed.com/' + soup.find('a', {'aria-label': 'Next'}).get('href')
except AttributeError:
break
#save the data as DF
columns = ['job_id',
'job_title',
'job_date',
'job_loc',
'job_summary',
'job_salary',
'job_url',
'company_name']
df = pd.DataFrame(data=records, columns=columns)
# adding to DF columns with search parameters
search_time = datetime.now().strftime("%d/%m/%Y, %H:%M:%S")
df.insert(loc=6, column="job_education", value="All")
df.insert(loc=9, column="company_type", value="All")
df.insert(loc=10, column="search_time", value=search_time)
df.insert(loc=11, column="search_position", value=position)
df.insert(loc=12, column="source", value=source)
return df
Everything is fine but for someone reason there is a discrepancy between the #jobs a company has listed and the number returned in the data dump csv. For example, apple has 9K open jobs, yet my data dump consistently yields only 1000.