Open arkkanoid opened 7 years ago
Thanks for your comments
Indeed Linkedin has a really agressive anti scraping policy. I think LinkedIn is blacklisting the block of ips of cloud providers (Aws, DigitalOcean ...). Even for residential ips it seems that after 2 or 3 requests your ip will be banned.
Just try a simple requests.get('https://www.linkedin.com/in/kennethreitz')
and look at the response and status code with the proxy. LinkedIn has a custom 999 status code for scraping.
It's curious, using different proxies always I get a 999 error. Getting a cookie from a web browser and adding it to your code it works with the proxies used before. So Linkedin probably detects that the request don't come from a web browser...
Interesting ! did you pass your own session cookie (the one linked to your linkedin profile) ? I know LinkedIn can deactivate if you scrape while being connected (using selenium, phantomjs ...) I also didn't test the code with proxies but it doesn't seem the issue comes from the code but likeley from Linkedin itself.
I used an anonymous session cookie and I could scrape ~50 profiles with this session. Maybe they're detecting if it's a browser or not who make the request with javascript.
I am having the same problem. But I worry that it is on Linkedin's side.
If I just open firefox (no linkedin cookies) and go to the test profile (https://www.linkedin.com/in/kennethreitz) it requests my credentials. I do not pass the request any login cookies (maybe if I did it would work?) But I think there is a limit to that. As soon as Linkedin detects a user viewing even 50 profiles/hour they may blacklist you as a user.
Any thoughts on this?
@ericfourrier I've had similar problems and linkedin authwalls were persistent, I find your work very helpful. I currently use it over proxycrawl API to avoid blocks, so instead of making a get request to linkedin directly, I make a request to that API which gives me the data. That is the only way I could do it right now. Thanks for the tool.
Hi, First of all, thanks for this amazing code. I'm trying to use it with a proxy but 100% of times it says that the IP is blacklisted. It happens also to you?
My code: