PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Glassdoor.com is not working #72

Closed thebigG closed 4 years ago

thebigG commented 4 years ago

Issue Template

Description

Just today I discovered that when scraping Glassdoor.com, JobFunnel fails. Please include the steps to reproduce. List any additional libraries that are affected.

Steps to Reproduce

  1. Comment out Indeed and Monster from providers options in settings.yaml as such:
        # - 'Indeed'
         # - 'Monster'
        - 'GlassDoor'
  1. Run job funnel funnel -s settings.yaml

    Expected behavior

    Scrape Glassdoor.com and store jobs in master_list.csv

Actual behavior

JobFunnel output:

jobfunnel initialized at 2020-05-05
no master-list, filter-list was not updated
jobfunnel glassdoor to pickle running @ 2020-05-05
failed to scrape GlassDoor: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 11, in <module>
    load_entry_point('JobFunnel==2.1.6', 'console_scripts', 'funnel')()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/__main__.py", line 55, in main
    jf.update_masterlist()
  File "/usr/local/lib/python3.6/dist-packages/jobfunnel/jobfunnel.py", line 291, in update_masterlist
    raise ValueError('No scraped jobs, cannot update masterlist')

Environment

I discovered this while inspecting glassdoor.py for testing. I will try my best to tackle this issue in the upcoming days. Hopefully we'll fix it soon!

Cheers!

studentbrad commented 4 years ago

Can confirm that selenium is a solution to this problem. They seem to be running javascript before bringing up the page which is why we can't get any html data. Using a webdriver you can bring up the page pretty easily and requires minimal effort but slows the process of scraping.

from selenium import webdriver
# initialize the webdriver
try:
    self.driver = webdriver.Chrome()
except FileNotFoundError:
    try:
        self.driver = webdriver.Firefox()
    except FileNotFoundError:
        raise FileNotFoundError('Sorry, chromedriver or geckodriver must de installed to scrape')
# get the search url
search = self.get_search_url()

# get the html data, initialize bs4 with lxml
self.driver.get(search)

# create the soup base
soup_base = BeautifulSoup(self.driver.page_source, self.bs4_parser)

You first must implement the get method for glassdoor as I have done.

if method == 'get':
    # form job search url
    search = ('https://www.glassdoor.{0}/Job/jobs.htm?'
              'clickSource=searchBtn&sc.keyword={1}&locT=C&locId={2}&jobType=&radius={3}'.format(
        self.search_terms['region']['domain'],
        self.query,
        location_response[0]['locationId'],
        self.convert_radius(
            self.search_terms['region']['radius'])))

We can keep other methods of scraping the same while changing glassdoor. If the user enables scraping of glassdoor in the yaml we will have to give warning of the need for chromedriver or geckodriver prior.

Checkout my branch to see the changes https://github.com/PaulMcInnis/JobFunnel/tree/studentbrad/glassdoor

thebigG commented 4 years ago

Thanks so much for this fix @studentbrad! Tried it and I got a captcha. I was able to "bypass it" by holding the program with a input() call and solving the captcha by hand in the actual graphical browser/driver.

    def scrape(self):
        """function that scrapes job posting from glassdoor and pickles it"""
        log_info(f'jobfunnel glassdoor to pickle running @ {self.date_string}')

        # get the search url
        search = self.get_search_url()

        # get the html data, initialize bs4 with lxml
        self.driver.get(search)
        input()

Obviously this will make the process a little more cumbersome for users, but I'm not sure if there is a better solution for this. If we were to merge this into the master branch, I don't think we'll be able to test glassdoor on TravisCI. What are your thoughts on this?

studentbrad commented 4 years ago

Nice work on the input(), I didn't know about that. We will have to update the testing. We should be able to test in TravisCI using headless browser testing. It is definitely more cumbersome. Unfortunately I do not believe that there is a better method of doing this.

thebigG commented 4 years ago

Will be merging your updated branch into https://github.com/thebigG/JobFunnel/tree/testing to start testing glassdoor.py. Will let you know if anything changes. Thanks so much for the prompt responses!

studentbrad commented 4 years ago

No problem! I’m really thankful that you’re willing to help out on this 😄

thebigG commented 4 years ago

This issue has been fixed and merged onto master on 37040f2732792d7dca9f7a3be8aaf6b878310fce.