PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

ValueError: empty vocabulary #79

Closed phcreery closed 3 years ago

phcreery commented 4 years ago

Issue Template


Standard search produces web scrape error

Steps to Reproduce

Standard search with

Expected behavior

Results of query

Actual behavior

No loglevel

Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\", line 55, in main
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\phcre\documents\jobs\jobfunnel\jobfunnel\tools\", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

query_words is empty therefore cannot be fit_transform by vectorizer

Debug Loglevel

GET {} "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381722
Finished Request
Found 8 glassdoor results for query=Advertising-Marketing-Coordinator-Account-Agency
GET {} "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 144
Finished Request
getting glassdoor page 1 :,5_IC1139946_KE6,54.htm?radius=25
POST {"url": ",5_IC1139946_KE6,54.htm?radius=25"} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 200 14
Finished Request
GET {} "GET /session/02b7e485dd5ae5ae4fb5c16bf406267a/source HTTP/1.1" 200 381666
Finished Request
DELETE {} "DELETE /session/02b7e485dd5ae5ae4fb5c16bf406267a/window HTTP/1.1" 200 14
Finished Request
found 8 unique job ids and 0 duplicates from glassdoor
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Calculating delay...
Done! Starting scrape!
delay of 0.00s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 770
Finished Request
delay of 22.19s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 22.34s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 24.76s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 27.24s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.04s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 29.64s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
delay of 18.15s, getting glassdoor search:
POST {"url": ""} "POST /session/02b7e485dd5ae5ae4fb5c16bf406267a/url HTTP/1.1" 404 899
Finished Request
glassdoor scrape job took 173.619s
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
removed 0 jobs present in filter-list
removed 0 jobs in blacklist from master-list
Traceback (most recent call last):
  File "C:\Users\phcre\AppData\Local\Programs\Python\Python38\Scripts\", line 11, in <module>
    load_entry_point('JobFunnel', 'console_scripts', 'funnel')()
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\", line 55, in main
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\", line 330, in update_masterlist
    tfidf_filter(self.scrape_data, masterlist)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\", line 118, in tfidf_filter
    duplicate_ids = tfidf_filter(cur_dict)
  File "c:\users\asdf\documents\jobs\jobfunnel\jobfunnel\tools\", line 90, in tfidf_filter
    similarities = cosine_similarity(vectorizer.fit_transform(query_words))
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1840, in fit_transform
    X = super().fit_transform(raw_documents)
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1198, in fit_transform
    vocabulary, X = self._count_vocab(raw_documents,
  File "c:\users\asdf\appdata\local\programs\python\python38\lib\site-packages\sklearn\feature_extraction\", line 1129, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words

webdriver manager returning 404 errors?

Variable Contents




odict_values([{'status': 'new', 'title': 'Account Manager Digital Marketing - Professional Services - Entertainment and Media Industry Opportunity', 'company': 'Gannett', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3596513699', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Marketing', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3593859227', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Marketing Coordinator', 'company': 'Gourmet Marketing LLC', 'location': 'Plano, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3319079566', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group, Inc.', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3582441465', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'COLLEGE GRADS & INTERNS - Entry Level Marketing & Advertising', 'company': 'Millennium Events Management', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3584976096', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Senior Account Executive (Marketing/Advertising)', 'company': 'The Point Group', 'location': 'Dallas, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3579768726', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Account Coordinator - Client Service', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3504589748', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}, {'status': 'new', 'title': 'Digital Account Coordinator', 'company': 'RKD Group', 'location': 'Richardson, TX', 'date': '', 'blurb': '', 'tags': '', 'link': '', 'id': '3543437733', 'provider': 'glassdoor', 'query': 'Advertising-Marketing-Coordinator-Account-Agency'}]) 


['3596513699', '3593859227', '3319079566', '3582441465', '3584976096', '3579768726', '3504589748', '3543437733']


['', '', '', '', '', '', '', '']


beautifulsoup4>=4.6.3 (4.9.1) lxml>=4.2.4 (4.5.1) requests>=2.19.1 (2.23.0) python-dateutil>=2.8.0 (2.8.1) PyYAML>=5.1 (5.3.1) scikit-learn>=0.21.2 (0.23.1) nltk>=3.4.1 (3.5) scipy>=1.4.1 (1.4.1) selenium>=3.141.0 (3.141.0) webdriver-manager>=2.4.0 (3.1.0) soupsieve>1.2 (2.0.1) certifi>=2017.4.17 (2020.4.5.2) urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9) chardet<4,>=3.0.2 (3.0.4) idna<3,>=2.5 (2.9) six>=1.5 (1.15.0) threadpoolctl>=2.0.0 (2.1.0) joblib>=0.11 (0.15.1) numpy>=1.13.3 (1.18.5) click (7.1.2) tqdm(4.46.1) atomicwrites>=1.0; (1.4.0) packaging (20.4) pluggy<1.0,>=0.12 (0.13.1)

PaulMcInnis commented 4 years ago

Thanks for this great bug report!

Looks like we've broken something. Will advise you to test again once #77 and #75 go in.

thebigG commented 4 years ago

I'm trying to reproduce this issue on Windows10, but haven't been able to. JobFunnel seems to work fine on my Windows10 installation. Could you give us more details as to how you installed python and pip on your Windows10 installation? If you don't mind, could you share with us any command line arguments you may have passed when running JobFunnel?

phcreery commented 4 years ago

Yea, no problem.

> funnel -s ..\mel\settings.yaml --log_level debug

> python3 --version
Python 3.8.3```

> pip3 --version
pip 19.2.3 from c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\pip (python 3.8)


# All paths are relative to this file.

# Paths.
# place the search right next to this file
output_path: './'

# Providers from which to search (case insensitive)
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This used to take ~10x longer to run than the other providers

# Filters.
    province: 'TX'
    city:     'Allen'
    domain:   'com'
    radius:   30

    - 'Advertising'
    - 'Marketing'
    - 'Coordinator'
    - 'Account'
    - 'Agency'

  - 'Sales'
  - 'Media'
  - 'Digital'
  - 'Social'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'info'

# Saves duplicates removed by tfidf filter to duplicate_list.csv
save_duplicates: False

# Turn on or off delaying
# set_delay: True 

# Delaying algorithm configuration
    # Functions used for delaying algorithm, options are: constant, linear, sigmoid
    function: 'linear'
    # Maximum delay/upper bound for converging random delay
    delay: 30
    # Minimum delay/lower bound for random delay  
    min_delay: 15 
    # Random delay
    random: True 
    # Converging random delay, only used if 'random' is set to True
    converge: True 
thebigG commented 4 years ago

Awesome! Thanks so much for all these details! Will investigate and get back to you as soon as I can(likely towards the evening).

thebigG commented 4 years ago

As a quick and dirty fix, could you comment out the GlassDoor scraper from the settings file? You can do that by adding a # to the GlassDoor scraper string. So basically make this line

- 'GlassDoor' # This used to take ~10x longer to run than the other providers

in the settings file look like this:

#- 'GlassDoor' # This used to take ~10x longer to run than the other providers

Does that fix it?

thebigG commented 4 years ago

I was able to re-produce! Can confirm this bug only affects the GlassDoorDynamic scraper, and not GlassDoorStatic. @phcreery Follow these steps to fix your problem for good:

  1. pip3 install --upgrade JobFunnel
  2. Make sure that you have JobFunnel 2.1.8 by typing pip3 show JobFunnel into the command line/power shell.
  3. Now, for this version we changed things a bit for the GlassDoor scraper. From 2.1.8 onward, we have two versions of the GlassDoor scraper; GlassDoorDynamic and GlassDoorStatic. Both of these scrape GlassDoor in different ways, but do the same thing. Because of this, the way you'll specify GlassDoor in the setiings.yaml changes a little bit. Instead of having:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This used to take ~10x longer to run than the other providers

You will change the GlassDoor setting, and the providers part of your settings.yaml should look like this now:

  - 'Indeed'
  - 'Monster'
  - 'GlassDoorStatic'
  # - 'GlassDoorDynamic'

Now your entire settings.yaml file should look like this:

# All paths are relative to this file.

# Paths.
# place the search right next to this file
output_path: './'

# Providers from which to search (case insensitive)
  - 'Indeed'
  - 'Monster'
  - 'GlassDoorStatic'
  # - 'GlassDoorDynamic'

# Filters.
    province: 'TX'
    city:     'Allen'
    domain:   'com'
    radius:   30

    - 'Advertising'
    - 'Marketing'
    - 'Coordinator'
    - 'Account'
    - 'Agency'

  - 'Sales'
  - 'Media'
  - 'Digital'
  - 'Social'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'info'

# Saves duplicates removed by tfidf filter to duplicate_list.csv
save_duplicates: False

# Turn on or off delaying
# set_delay: True 

# Delaying algorithm configuration
    # Functions used for delaying algorithm, options are: constant, linear, sigmoid
    function: 'linear'
    # Maximum delay/upper bound for converging random delay
    delay: 30
    # Minimum delay/lower bound for random delay  
    min_delay: 15 
    # Random delay
    random: True 
    # Converging random delay, only used if 'random' is set to True
    converge: True 

Following these steps should fix your problem.

It is unlikely that the GlassDoorStatic scraper will fail, but if it does, you can always just comment it out of the settings.yaml like this: # - 'GlassDoorStatic'

Sorry we don't have clear documentation on these changes. Will make sure to update the readme on the next PR to make this clear to users.

Hope this works! Cheers!

phcreery commented 4 years ago

Awesome! I will try this as soon as possible.