SoloHombs / My-Portifolio

Excel,SQL,PowerBI,Python,Tableau,Docker,SQL server,Mysql
0 stars 0 forks source link

Linkedin Job Scraping #6

Open SoloHombs opened 2 months ago

SoloHombs commented 2 months ago

Introduction In this project, we'll build a web scraper to extract job listings from a popular job search platform. We'll extract job titles, companies, locations, job descriptions, and other relevant information.

Here are the main steps we'll follow in this project:

Setup our development environment Understand the basics of web scraping Analyze the website structure of our job search platform Write the Python code to extract job data from our job search platform Save the data to a CSV file Test our web scraper and refine our code as needed Prerequisites: Before starting this project, you should have some basic knowledge of Python programming and HTML structure. In addition, you may want to use the following packages in your Python environment:

pandas Selenium csv datetime These packages should already be installed in your Jupyter Notebook environment, however if you'd like to install additional packages that are not included in this environment or are working off platform you can install additional packages using !pip install packagename within a notebook cell such as: !pip install pandas !pip install selenium

SoloHombs commented 2 months ago

`#import libraries from selenium import webdriver from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support.select import Select from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException from selenium.common.exceptions import StaleElementReferenceException import pandas as pd import time

creating path for chrome and go to chrome

service=Service(executable_path='chromedriver.exe')

driver=webdriver.Chrome(service=service) driver.maximize_window()

Go to google

url='https://www.google.com/' driver.get(url)

Input values to google search

time.sleep(5)

WebDriverWait(driver,5).until(EC.presence_of_element_located((By.CLASS_NAME,'gLFyf'))) input_element=driver.find_element(By.CLASS_NAME,'gLFyf') input_element.clear() #incase there was text in the input box input_element.send_keys('linkedin.com'+Keys.ENTER)

Go to the website that we searched through google

WebDriverWait(driver,5).until(EC.presence_of_element_located((By.PARTIAL_LINK_TEXT,"LinkedIn: Log In or Sign Up"))) link=driver.find_element(By.PARTIAL_LINK_TEXT,"LinkedIn: Log In or Sign Up") link.click() time.sleep(5)

Login to linkedin

WebDriverWait(driver,5).until(EC.presence_of_element_located((By.ID,'session_key'))) username=driver.find_element(By.ID,'session_key') username.clear() #incase there was text in the input box username.send_keys('XXXXXXXXXXXXX.com'+Keys.ENTER) # put your own username to login✍✍ WebDriverWait(driver,5).until(EC.presence_of_element_located((By.ID,'session_password'))) password=driver.find_element(By.ID,'session_password') password.clear()
password.send_keys('XXXXXXXXXXXXXXXXXXXXXXXXXXXX'+Keys.ENTER) # put your personal password instead of mine😂😂 time.sleep(10)

Using search button to look for jobs

WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"(//li//a)[3]"))) job_link=driver.find_element(By.XPATH,"(//li//a)[3]") job_link.click() time.sleep(10)#Now Lets search the job by title WebDriverWait(driver,5).until(EC.presence_of_element_located((By.XPATH,"//input[@aria-label='Search by title, skill, or company'][1]"))) search=driver.find_element(By.XPATH,"//input[@aria-label='Search by title, skill, or company'][1]")
search.clear()
search.send_keys('Data Analyst jobs'+Keys.ENTER) time.sleep(10)#scroll the whole page 🕸🕸 driver.execute_script('window.scrollTo(0,document.body.scrollHeight);') time.sleep(10)

lets take all the data using see all button

try: WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//a[@aria-label='Show all Remote opportunities']"))) seemore = driver.find_element(By.XPATH, "//a[@aria-label='Show all Remote opportunities']") seemore.click() except TimeoutException: print("Timed out waiting for 'See more jobs' button to comeout but its nowhere to be found😣😣")

Now lets scrape the data

for page_number in range(1, 10): try:

Now lets Scrape the data for job titles/name

    WebDriverWait(driver,10).until(EC.presence_of_all_elements_located((By.XPATH,"//a[@class='disabled ember-view job-card-container__link job-card-list__title job-card-list__title--link']")))
    jobs=driver.find_elements(By.XPATH,"//a[@class='disabled ember-view job-card-container__link job-card-list__title job-card-list__title--link']")
    job_names=[]
    for names in jobs:
        job_names.append(names.text)
        for jb in job_names:
         nm=jb
         #print(nm)

#Now lets Scrape the data for organisation name
    WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.XPATH,"//span[@class='job-card-container__primary-description ']")))
    org=driver.find_elements(By.XPATH,"//span[@class='job-card-container__primary-description ']")
    org_name=[]
    for nam in org:
        org_name.append(nam.text)
        for og in org_name:
            orgy=og
            #print(orgy)

#Now lets Scrape the data for job type/location
    WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.XPATH,"//li[@class='job-card-container__metadata-item ']")))
    location=driver.find_elements(By.XPATH,"//li[@class='job-card-container__metadata-item ']")
    job_loc=[]
    try:
        for loc in location:
            job_loc.append(loc.text)
            for lc in job_loc:
                lcname=lc
                #print(lcname)
    except:
        job_loc.append('job location not found')

#Now lets Scrape the data for period
    WebDriverWait(driver,5).until(EC.presence_of_all_elements_located((By.TAG_NAME,"time")))
    time=driver.find_elements(By.TAG_NAME,"time")
    period=[]
    for tm in time:
        period.append(tm.text)
        for t in period:
            tmy=t
            #print(tmy)
    data=zip(job_names,org_name,job_loc,period)
    Linkedin=pd.DataFrame(data,columns=['Job name','Organisation name','Job Location','Period job Posted'])
    Linkedin.to_csv(r'C:\Users\DELL\Desktop\Webdriver\Linkedin Jobs.csv',index=False)

except TimeoutException:
    print(f"Timed out waiting for page {page_number} element")

try:
    # Find the pagination element containing all page numbers
    pagination = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.XPATH, "//ul[@class='C']"))
    )

    # Find the page number element for the current page
    page_to_click = pagination.find_element(By.XPATH, "//li[@class='artdeco-pagination__indicator artdeco-pagination__indicator--number active selected ember-view']")
    page_to_click.click()

    # Add your scraping logic here to scrape data from the current page

except Exception as e:
    print(f"Error while navigating to page {page_number}: {str(e)}")

Close the WebDriver session after scraping all pages

driver.quit()

`

SoloHombs commented 2 months ago

Linkedin.head() Linkedin.tail() len(job_loc) Linkedin.loc[[2,10]] Linkedin.describe()

SoloHombs commented 2 months ago

[Uploading Linkedin Jobs.csv…]()