scrape_linkedin
is a python package to scrape all details from public LinkedIn
profiles, turning the data into structured json. You can scrape Companies
and user profiles with this package.
Warning: LinkedIn has strong anti-scraping policies, they may blacklist ips making unauthenticated or unusual requests
Run pip install git+git://github.com/austinoboyle/scrape-linkedin-selenium.git
git clone https://github.com/austinoboyle/scrape-linkedin-selenium.git
Run python setup.py install
Tests are (so far) only run on static html files. One of which is a linkedin profile, the other is just used to test some utility functions.
Because of Linkedin's anti-scraping measures, you must make your selenium browser look like an actual user. To do this, you need to add the li_at cookie to the selenium session.
www.linkedin.com
optionThere are two ways to set your li_at cookie:
$ export LI_AT=YOUR_LI_AT_VALUE
C:/foo/bar> set LI_AT=YOUR_LI_AT_VALUE
>>> with ProfileScraper(cookie='YOUR_LI_AT_VALUE') as scraper:
A cookie value passed directly to the Scraper will override your environment variable if both are set.
See /examples
scrape_linkedin comes with a command line argument module scrapeli
created
using click.
Note: CLI only works with Personal Profiles as of now.
Options:
Examples:
$ scrapeli --user=austinoboyle
$ scrapeli --user=austinoboyle -a skills
$ scrapeli -i /path/file.html -o output.json
Use ProfileScraper
component to scrape profiles.
from scrape_linkedin import ProfileScraper
with ProfileScraper() as scraper:
profile = scraper.scrape(user='austinoboyle')
print(profile.to_dict())
Profile
- the class that has properties to access all information pulled from
a profile. Also has a to_dict() method that returns all of the data as a dict
with open('profile.html', 'r') as profile_file:
profile = Profile(profile_file.read())
print (profile.skills)
# [{...} ,{...}, ...]
print (profile.experiences)
# {jobs: [...], volunteering: [...],...}
print (profile.to_dict())
# {personal_info: {...}, experiences: {...}, ...}
Structure of the fields scraped
Use CompanyScraper
component to scrape companies.
from scrape_linkedin import CompanyScraper
with CompanyScraper() as scraper:
company = scraper.scrape(company='facebook')
print(company.to_dict())
Company
- the class that has properties to access all information pulled from
a company profile. There will be three properties: overview, jobs, and life.
Overview is the only one currently implemented.
with open('overview.html', 'r') as overview,
open('jobs.html', 'r') as jobs,
open('life.html', 'r') as life:
company = Company(overview, jobs, life)
print (company.overview)
# {...}
Structure of the fields scraped
Pass these keyword arguments into the constructor of your Scraper to override default values. You may (for example) want to decrease/increase the timeout if your internet is very fast/slow.
{str}
: li_at cookie value (overrides env variable)
None
{selenium.webdriver}
: driver type to use
selenium.webdriver.Chrome
{dict}
: kwargs to pass to driver constructor
{}
{float}
: time(s) to pause during scroll increments
0.1
{int}
num pixels to scroll down each time
300
{float}
: default time to wait for async content to load
10
New in version 0.2: built in parallel scraping functionality. Note that the up-front cost of starting a browser session is high, so in order for this to be beneficial, you will want to be scraping many (> 15) profiles.
from scrape_linkedin import scrape_in_parallel, CompanyScraper
companies = ['facebook', 'google', 'amazon', 'microsoft', ...]
#Scrape all companies, output to 'companies.json' file, use 4 browser instances
scrape_in_parallel(
scraper_type=CompanyScraper,
items=companies,
output_file="companies.json",
num_instances=4
)
Parameters:
{scrape_linkedin.Scraper}
: Scraper to use{list}
: List of items to be scraped{str}
: path to output file{int}
: number of parallel instances of selenium to run{str}
: name of temporary directory to use to store data from intermediate steps
{dict}
: dict of keyword arguments to pass to the driver function.
{any}
: extra keyword arguments to pass to the scraper_type
constructor for each jobReport bugs and feature requests here.