ericfourrier / scrape-linkedin

Scrape a public LinkedIn profile.
MIT License
153 stars 51 forks source link

Always blocked #7

Open arkkanoid opened 7 years ago

arkkanoid commented 7 years ago

Hi, First of all, thanks for this amazing code. I'm trying to use it with a proxy but 100% of times it says that the IP is blacklisted. It happens also to you?

My code:

import requests
from pylinkedin.scraper import LinkedinItem
from pylinkedin.utils import CustomRequest

while True:
    url = "http://gimmeproxy.com/api/getProxy"
    querystring = {"get":"true","supportsHttps":"true","anonymityLevel":"1", "protocol":"http"}
    r = requests.request("GET", url,  params=querystring)
    res = r.json()
    proxy = res['curl']
    c = CustomRequest(list_proxies=[{'https':proxy}])
    try:
        LinkedinItem(url = 'https://www.linkedin.com/in/kennethreitz')
    except:
        pass
ericfourrier commented 7 years ago

Thanks for your comments

Indeed Linkedin has a really agressive anti scraping policy. I think LinkedIn is blacklisting the block of ips of cloud providers (Aws, DigitalOcean ...). Even for residential ips it seems that after 2 or 3 requests your ip will be banned.

Just try a simple requests.get('https://www.linkedin.com/in/kennethreitz') and look at the response and status code with the proxy. LinkedIn has a custom 999 status code for scraping.

arkkanoid commented 6 years ago

It's curious, using different proxies always I get a 999 error. Getting a cookie from a web browser and adding it to your code it works with the proxies used before. So Linkedin probably detects that the request don't come from a web browser...

ericfourrier commented 6 years ago

Interesting ! did you pass your own session cookie (the one linked to your linkedin profile) ? I know LinkedIn can deactivate if you scrape while being connected (using selenium, phantomjs ...) I also didn't test the code with proxies but it doesn't seem the issue comes from the code but likeley from Linkedin itself.

arkkanoid commented 6 years ago

I used an anonymous session cookie and I could scrape ~50 profiles with this session. Maybe they're detecting if it's a browser or not who make the request with javascript.

DarkShineLights commented 6 years ago

I am having the same problem. But I worry that it is on Linkedin's side.

If I just open firefox (no linkedin cookies) and go to the test profile (https://www.linkedin.com/in/kennethreitz) it requests my credentials. I do not pass the request any login cookies (maybe if I did it would work?) But I think there is a limit to that. As soon as Linkedin detects a user viewing even 50 profiles/hour they may blacklist you as a user.

Any thoughts on this?

rochenka commented 6 years ago

@ericfourrier I've had similar problems and linkedin authwalls were persistent, I find your work very helpful. I currently use it over proxycrawl API to avoid blocks, so instead of making a get request to linkedin directly, I make a request to that API which gives me the data. That is the only way I could do it right now. Thanks for the tool.