Closed thebigG closed 4 years ago
Can confirm that selenium is a solution to this problem. They seem to be running javascript before bringing up the page which is why we can't get any html data. Using a webdriver you can bring up the page pretty easily and requires minimal effort but slows the process of scraping.
from selenium import webdriver
# initialize the webdriver
try:
self.driver = webdriver.Chrome()
except FileNotFoundError:
try:
self.driver = webdriver.Firefox()
except FileNotFoundError:
raise FileNotFoundError('Sorry, chromedriver or geckodriver must de installed to scrape')
# get the search url
search = self.get_search_url()
# get the html data, initialize bs4 with lxml
self.driver.get(search)
# create the soup base
soup_base = BeautifulSoup(self.driver.page_source, self.bs4_parser)
You first must implement the get method for glassdoor as I have done.
if method == 'get':
# form job search url
search = ('https://www.glassdoor.{0}/Job/jobs.htm?'
'clickSource=searchBtn&sc.keyword={1}&locT=C&locId={2}&jobType=&radius={3}'.format(
self.search_terms['region']['domain'],
self.query,
location_response[0]['locationId'],
self.convert_radius(
self.search_terms['region']['radius'])))
We can keep other methods of scraping the same while changing glassdoor.
If the user enables scraping of glassdoor in the yaml we will have to give warning of the need for chromedriver
or geckodriver
prior.
Checkout my branch to see the changes https://github.com/PaulMcInnis/JobFunnel/tree/studentbrad/glassdoor
Thanks so much for this fix @studentbrad! Tried it and I got a captcha. I was able to "bypass it" by holding the program with a input()
call and solving the captcha by hand in the actual graphical browser/driver.
def scrape(self):
"""function that scrapes job posting from glassdoor and pickles it"""
log_info(f'jobfunnel glassdoor to pickle running @ {self.date_string}')
# get the search url
search = self.get_search_url()
# get the html data, initialize bs4 with lxml
self.driver.get(search)
input()
Obviously this will make the process a little more cumbersome for users, but I'm not sure if there is a better solution for this. If we were to merge this into the master branch, I don't think we'll be able to test glassdoor on TravisCI. What are your thoughts on this?
Nice work on the input()
, I didn't know about that. We will have to update the testing. We should be able to test in TravisCI using headless browser testing. It is definitely more cumbersome. Unfortunately I do not believe that there is a better method of doing this.
Will be merging your updated branch into https://github.com/thebigG/JobFunnel/tree/testing to start testing glassdoor.py
.
Will let you know if anything changes.
Thanks so much for the prompt responses!
No problem! I’m really thankful that you’re willing to help out on this 😄
This issue has been fixed and merged onto master on 37040f2732792d7dca9f7a3be8aaf6b878310fce.
Issue Template
Description
Just today I discovered that when scraping Glassdoor.com, JobFunnel fails. Please include the steps to reproduce. List any additional libraries that are affected.
Steps to Reproduce
settings.yaml
as such:funnel -s settings.yaml
Expected behavior
Scrape Glassdoor.com and store jobs in
master_list.csv
Actual behavior
JobFunnel output:
Environment
I discovered this while inspecting glassdoor.py for testing. I will try my best to tackle this issue in the upcoming days. Hopefully we'll fix it soon!
Cheers!