As you guys are probably aware, the Glassdoor scraper stopped working. This has been filed as issue #72. Thanks to @studentbrad who came to the rescue, we have been able to come up with a solution. While working on this, I discovered that Glassdoor will ask the user for a CAPTCHA that one will have to complete on the browser window that the selenium web driver opens up. The way I decided to tackle this was to hold the program with a input() call and give the user explicit instructions on filling out the CAPTCHA. While it does solve the problem, this will be a bit more cumbersome to users when scraping Glassdoor compared to the non-selenium way. I tried my best to streamline this process for users, but please let me know if it could be better! Another issue with this approach is testing, specifically testing on TravisCI. To the best of my knowledge, selenium drivers aren't able to be "headless"(no GUI). I believe Chrome has a headless mode, but not the rest of the browsers. However, even if we used some headless web driver, we still need a person to complete the CAPTCHA. This is the very reason why I have disabled Glassdoor in the providers section of the JobFunnel/demo/settings.yaml and the /JobFunnel/jobfunnel/config/settings.yaml files to prevent TravisCI for hanging forever on the input() call when testing Glassdoor.
Talking more technically, I took the liberty of adding webdriver_manager to the project so that we don't have to worry about managing binary/executable webdriver files for every browser and platform. I also moved all of webdriver_manager and selenium functionality to tools.py. If you guys are ok with this change, I will be adding tests to cover this new functionality on tools.py in the future. I also have added a note to the readme that lets users know about this functionality regarding Glassdoor.
Hope this is clear enough.
List any developers that will be affected or those who you had merge conflicts with.
Context of change
Please add options that are relevant and mark any boxes that apply.
[x] Software (software that runs on the PC)
[ ] Library (library that runs on the PC)
[ ] Tool (tool that assists coding development)
[ ] Other
Type of change
Please mark any boxes that apply.
[x] Bug fix (non-breaking change which fixes an issue)
[ ] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[x] This change requires a documentation update
How Has This Been Tested?
Please describe the tests that you ran to verify your changes.
NOTE: Make sure you uncomment GlassDoor in your settings.yaml file to test out the Glassdoor scraper.
Please also list any relevant details for your test configuration.
[x] Tested all scrapers on the Canada(.ca) domain
[x] Tested all scrapers on the United States(.com) domain
[x] Tested the entire pytest suite as a whole
Checklist:
Please mark any boxes that have been completed.
[x] I have performed a self-review of my own code.
[x] I have commented my code, particularly in hard-to-understand areas.
[x] I have made corresponding changes to the documentation.
[ ] My changes generate no new warnings.
[ ] I have added tests that prove my fix is effective or that my feature works.
[x] New and existing unit tests pass locally with my changes.
[x] Any dependent changes have been merged and published in downstream modules.
Glassdoor issues
Hello everyone, hope you're all doing well!
Description
As you guys are probably aware, the Glassdoor scraper stopped working. This has been filed as issue #72. Thanks to @studentbrad who came to the rescue, we have been able to come up with a solution. While working on this, I discovered that Glassdoor will ask the user for a CAPTCHA that one will have to complete on the browser window that the selenium web driver opens up. The way I decided to tackle this was to hold the program with a
input()
call and give the user explicit instructions on filling out the CAPTCHA. While it does solve the problem, this will be a bit more cumbersome to users when scraping Glassdoor compared to the non-selenium way. I tried my best to streamline this process for users, but please let me know if it could be better! Another issue with this approach is testing, specifically testing on TravisCI. To the best of my knowledge, selenium drivers aren't able to be "headless"(no GUI). I believe Chrome has a headless mode, but not the rest of the browsers. However, even if we used some headless web driver, we still need a person to complete the CAPTCHA. This is the very reason why I have disabled Glassdoor in the providers section of theJobFunnel/demo/settings.yaml
and the/JobFunnel/jobfunnel/config/settings.yaml
files to prevent TravisCI for hanging forever on theinput()
call when testing Glassdoor.Talking more technically, I took the liberty of adding webdriver_manager to the project so that we don't have to worry about managing binary/executable webdriver files for every browser and platform. I also moved all of webdriver_manager and selenium functionality to
tools.py
. If you guys are ok with this change, I will be adding tests to cover this new functionality ontools.py
in the future. I also have added a note to the readme that lets users know about this functionality regarding Glassdoor.Hope this is clear enough.
List any developers that will be affected or those who you had merge conflicts with.
Context of change
Please add options that are relevant and mark any boxes that apply.
Type of change
Please mark any boxes that apply.
How Has This Been Tested?
Please describe the tests that you ran to verify your changes. NOTE: Make sure you uncomment GlassDoor in your
settings.yaml
file to test out the Glassdoor scraper. Please also list any relevant details for your test configuration.Checklist:
Please mark any boxes that have been completed.