austinoboyle / scrape-linkedin-selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
MIT License
457 stars 162 forks source link
linkedin python scrape scraper scraping selenium selenium-webdriver web-scraper web-scraping

scrape_linkedin

Introduction

scrape_linkedin is a python package to scrape all details from public LinkedIn profiles, turning the data into structured json. You can scrape Companies and user profiles with this package.

Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests

Table of Contents

Installation

Install with pip

Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git

Install from source

git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git

Run python setup.py install

Tests

Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.

Getting & Setting LI_AT

Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.

Getting LI_AT

  1. Navigate to www.linkedin.com and log in
  2. Open browser developer tools (Ctrl-Shift-I or right click -> inspect element)
  3. Select the appropriate tab for your browser (Application on Chrome, Storage on Firefox)
  4. Click the Cookies dropdown on the left-hand menu, and select the www.linkedin.com option
  5. Find and copy the li_at value

Setting LI_AT

There are two ways to set your li_at cookie:

  1. Set the LI_AT environment variable
    • $ export LI_AT=YOUR_LI_AT_VALUE
    • On Windows: C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE
  2. Pass the cookie as a parameter to the Scraper object.

    >>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:

A cookie value passed directly to the Scraper will override your environment variable if both are set.

Examples

See /examples

Usage

Command Line

scrape_linkedin comes with a command line argument module scrapeli created using click.

Note: CLI only works with Personal Profiles as of now.

Options:

Examples:

Python Package

Profiles

Use ProfileScraper component to scrape profiles.

from scrape_linkedin import ProfileScraper

with ProfileScraper() as scraper:
    profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())

Profile - the class that has properties to access all information pulled from a profile. Also has a to_dict() method that returns all of the data as a dict

with open('profile.html', 'r') as profile_file:
    profile = Profile(profile_file.read())

print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}

Structure of the fields scraped

Companies

Use CompanyScraper component to scrape companies.

from scrape_linkedin import CompanyScraper

with CompanyScraper() as scraper:
    company = scraper.scrape(company='facebook')
print(company.to_dict())

Company - the class that has properties to access all information pulled from a company profile. There will be three properties: overview, jobs, and life. Overview is the only one currently implemented.

with open('overview.html', 'r') as overview,
    open('jobs.html', 'r') as jobs,
    open('life.html', 'r') as life:
        company = Company(overview, jobs, life)

print (company.overview)
# {...}

Structure of the fields scraped

config

Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.

Scraping in Parallel

New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.

Example

from scrape_linkedin import scrape_in_parallel, CompanyScraper

companies = ['facebook', 'google', 'amazon', 'microsoft', ...]

#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
    scraper_type=CompanyScraper,
    items=companies,
    output_file="companies.json",
    num_instances=4
)

Configuration

Parameters:

Issues

Report bugs and feature requests here.