ericfourrier / scrape-linkedin

Scrape a public LinkedIn profile.
MIT License
153 stars 51 forks source link

pylinkedin

Travis-CI Build Status

Introduction

pylinkedin is a python package to scrape all details from public LinkedIn profiles. It can also be used as a parser to transform html LinkedIn profiles into structured json.

Some precautions you should take if you want scrape LinkedIn with python :

Installation

Install with pip

Run pip install git+git://github.com/ericfourrier/scrape-linkedin.git

Install from source

git clone https://github.com/ericfourrier/scrape-linkedin.git

Run python setup.py install

Tests

The tests are runs with a html file from a LinkedIn profile. The main reason is because Travis use aws machine and its ips are banned by Linkedin.

Especially the fact that the test suite is passed is not a good indicator than the package will work (Your ip can be banned or LinkedIn html source code changed).

You can still run the test suite at the root of the package with pytest: py.test test.py.

Using this package

Command line

pylinkedin comes with a simple command line argument module pylinkedin.

Options:

Examples:

Python Package

It relies on two class:

CustomRequest which is just a way to customise your http request specifying a list of user-agents or proxies.

from pylinkedin.utils import CustomRequest
c = CustomRequest() # default with rotating proxies
c = CustomRequest(rotate_ua=False) # without rotating user-agent
c = CustomRequest(list_proxies=[{'https':'http://186.233.94.106:8080',
'http':'http://186.233.94.106:8080'}]))

LinkedinItem is the main class, you can instantiate it with the URL of public profile using the url parameter, or with the HTML contents of the profile page, using html_string. See test.py for an example of using a save HTML file as input for the scrapper.

from pylinkedin.scraper import LinkedinItem
l = LinkedinItem(url='https://www.linkedin.com/in/kennethreitz')
l = LinkedinItem(html_string=profile_string)

You can customize your requests using CustomRequest class for LinkedinItem

c = CustomRequest(rotating_ua = True)
url_to_scrape = "https://www.linkedin.com/in/jeffweiner08"
l = LinkedinItem(url=url_to_scrape, crequest=c) # passing requests with rotating user-agent

To use the html_string, make sure to browse to the public version of the profile page, as the private version will not work. The private version is the one showing the edit controls next to each section.

'LinkedinItem' has the folowing syntax the get the info :

l.name # to get the name
l.skills # to get the skills
l.publications  # to get the publications
...
# the most important
l.to_dict() to get all infos

Exhaustive list of the fields scraped

[volunteerings, last_name, number_recommendations, number_connections, current_location, honors, first_name, current_title, test_scores, current_industry, languages, similar_profiles, interests, profile_img_url, current_education, educations, experiences, groups, organizations, certifications, name, skills, websites, summary, project, courses, publications,recommendations]

Issues

Package is not actively maintained.

You can post bugs and issues here.