austinoboyle / scrape-linkedin-selenium

`scrape_linkedin` is a python package that allows you to scrape personal LinkedIn profiles & company pages - turning the data into structured json.
MIT License
454 stars 162 forks source link

BP for Running on Remote Server and not getting caught by LinkedIn #22

Closed dsc03 closed 5 years ago

dsc03 commented 5 years ago

Hey Austin,

Thanks for responding to my previous comment. I was able to get the scraper to work on a remote server. However, once I started running it remotely, LinkedIn caught on and started blocking me.

I was wondering if you had any BP for either bypassing this or preventing it?

Currently, I'm running this Chrome headless with the following options:

        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')

        with CompanyScraper(driver_options={'chrome_options': chrome_options}) as scraper:

Let me know if you have tips or suggestions. Anything would be appreciated.

-Daniel

austinoboyle commented 5 years ago

Hmmmm, can you show me an example of an error where you were blocked? If they were sending their 999 response code, my guess would be that they block IP address ranges from cloud computing providers like AWS. How aggressive were the scrapes that you were running?

dsc03 commented 5 years ago

So the error is this:

ValueError: Took too long to load company.  Common problems/solutions:
                1. Invalid LI_AT value: ensure that yours is correct (they
                   update frequently)
                2. Slow Internet: increase the timeout parameter in the Scraper constructor

I was doing very low volume, no more than 50 companies a day.

austinoboyle commented 5 years ago

Are you sure that the LI_AT value didn't just expire? It has happened to me from time to time

dsc03 commented 5 years ago

So I just refreshed LI_AT cookie and tried running it, but still am getting blocked.

dsc03 commented 5 years ago

What's strange is that its not actually refreshing my cookie on LinkedIn.

dsc03 commented 5 years ago

In fact, I realize its not an issue with the cookie because I stopped using a headless driver. I opened the Chrome console when Selenium was running and saw the following errors when it got to the LinkedIn company page.

screen shot 2018-09-19 at 1 53 15 pm

I searched the second error on StackOverflow, and it seems to be a CORS issue.

austinoboyle commented 5 years ago

Can you show me the code that produces this error? A CORS error would indicate to me that this is a LinkedIn problem, but it's also possible that the error in the console is not related to the error on the page.

dsc03 commented 5 years ago

I was just replying to you haha. So I'm not sure what changed, but I'm not experiencing the same issues anymore (even using the same cookie), and while I plan to do some more digging to figure out what was going on, looking back I don't think those errors we're actually related. I'll message you once I figure out what was going on if I think it'll help others in the future!

Thanks for all your help. Really appreciate it.

austinoboyle commented 5 years ago

No problem, if you like my package and find it useful, please give it a star! I'm going to close this issue for now, feel free to re-open if the issue comes back.