Bunsly / JobSpy

Jobs scraper library for LinkedIn, Indeed, Glassdoor & ZipRecruiter
https://usejobspy.com
MIT License
550 stars 108 forks source link

Enhancement - Scrap data when status code 429, instead of exiting scraping - Linkedin #129

Open muzaT opened 3 months ago

muzaT commented 3 months ago

Hi @cullenwatson !

I was wondering, if we put a simple check on status code 429. If status code is 429, it should keep retrying/attempting the website until it receives status code 200 because if status code 429 appears we can refresh browser and resume. It will most help with proxy, which provide auto rotating. This is suggestion is specifically for Linkedin.

Something like this for without auto-rotating proxies:

if page.status_code == 429:
        print("Error fetching page, Error: " + str(page.status_code))
        while True:
            page = requests.get(url, headers=headers)
            if page.status_code == 200:
                break
            else:
                print("Retrying website!") 

If proxy provider is providing auto-rotating, we can pass another new params as "auto-rotating_proxy = True" and use/execute code something like this:

if page.status_code != 200:
        print("Error fetching page, Error: " + str(page.status_code))
        while True:
            page = requests.get(url, headers=headers)
            if page.status_code == 200:
                break
            else:
                print("Retrying website!") 

This will change the proxy automatically on each re-attempt, whenever there is an error. I hope this helps the community and users.

ZacharyHampton commented 3 months ago

I think the write way to handle this here is to hand the user the session, so they can handle the responses however they want using requests hooks? Agree?

muzaT commented 3 months ago

@ZacharyHampton Yes that will work out well but a newbie might find it bit difficult to handle. What we can do is provide both functionalities. We pass parameter something like (session_response), if true it will automatically handle similar to the solution I have proposed and if it is false, it lets user handle the session. How does that sounds?