PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Glassdoor issues #73

Closed thebigG closed 4 years ago

thebigG commented 4 years ago

Glassdoor issues

Hello everyone, hope you're all doing well!

Description

As you guys are probably aware, the Glassdoor scraper stopped working. This has been filed as issue #72. Thanks to @studentbrad who came to the rescue, we have been able to come up with a solution. While working on this, I discovered that Glassdoor will ask the user for a CAPTCHA that one will have to complete on the browser window that the selenium web driver opens up. The way I decided to tackle this was to hold the program with a input() call and give the user explicit instructions on filling out the CAPTCHA. While it does solve the problem, this will be a bit more cumbersome to users when scraping Glassdoor compared to the non-selenium way. I tried my best to streamline this process for users, but please let me know if it could be better! Another issue with this approach is testing, specifically testing on TravisCI. To the best of my knowledge, selenium drivers aren't able to be "headless"(no GUI). I believe Chrome has a headless mode, but not the rest of the browsers. However, even if we used some headless web driver, we still need a person to complete the CAPTCHA. This is the very reason why I have disabled Glassdoor in the providers section of the JobFunnel/demo/settings.yaml and the /JobFunnel/jobfunnel/config/settings.yaml files to prevent TravisCI for hanging forever on the input() call when testing Glassdoor.

Talking more technically, I took the liberty of adding webdriver_manager to the project so that we don't have to worry about managing binary/executable webdriver files for every browser and platform. I also moved all of webdriver_manager and selenium functionality to tools.py. If you guys are ok with this change, I will be adding tests to cover this new functionality on tools.py in the future. I also have added a note to the readme that lets users know about this functionality regarding Glassdoor.

Hope this is clear enough.

List any developers that will be affected or those who you had merge conflicts with.

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. NOTE: Make sure you uncomment GlassDoor in your settings.yaml file to test out the Glassdoor scraper. Please also list any relevant details for your test configuration.

Checklist:

Please mark any boxes that have been completed.

studentbrad commented 4 years ago

You're the man Lorenzo :sunglasses: nice work!

thebigG commented 4 years ago

Would like to say thank you again to @studentbrad! Couldn't have fixed this without him!