Closed phcreery closed 3 years ago
Thanks for this great bug report!
Looks like we've broken something. Will advise you to test again once #77 and #75 go in.
I'm trying to reproduce this issue on Windows10, but haven't been able to. JobFunnel seems to work fine on my Windows10 installation. Could you give us more details as to how you installed python and pip on your Windows10 installation? If you don't mind, could you share with us any command line arguments you may have passed when running JobFunnel?
Yea, no problem.
> funnel -s ..\mel\settings.yaml --log_level debug
> python3 --version
Python 3.8.3```
> pip3 --version
pip 19.2.3 from c:\users\phcre\appdata\local\programs\python\python38\lib\site-packages\pip (python 3.8)
# All paths are relative to this file.
# Paths.
# place the search right next to this file
output_path: './'
# Providers from which to search (case insensitive)
providers:
- 'Indeed'
- 'Monster'
- 'GlassDoor' # This used to take ~10x longer to run than the other providers
# Filters.
search_terms:
region:
province: 'TX'
city: 'Allen'
domain: 'com'
radius: 30
keywords:
- 'Advertising'
- 'Marketing'
- 'Coordinator'
- 'Account'
- 'Agency'
black_list:
- 'Sales'
- 'Media'
- 'Digital'
- 'Social'
# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'info'
# Saves duplicates removed by tfidf filter to duplicate_list.csv
save_duplicates: False
# Turn on or off delaying
# set_delay: True
# Delaying algorithm configuration
delay_config:
# Functions used for delaying algorithm, options are: constant, linear, sigmoid
function: 'linear'
# Maximum delay/upper bound for converging random delay
delay: 30
# Minimum delay/lower bound for random delay
min_delay: 15
# Random delay
random: True
# Converging random delay, only used if 'random' is set to True
converge: True
Awesome! Thanks so much for all these details! Will investigate and get back to you as soon as I can(likely towards the evening).
As a quick and dirty fix, could you comment out the GlassDoor scraper from the settings file?
You can do that by adding a #
to the GlassDoor scraper string.
So basically make this line
- 'GlassDoor' # This used to take ~10x longer to run than the other providers
in the settings file look like this:
#- 'GlassDoor' # This used to take ~10x longer to run than the other providers
Does that fix it?
I was able to re-produce! Can confirm this bug only affects the GlassDoorDynamic
scraper, and not GlassDoorStatic
.
@phcreery Follow these steps to fix your problem for good:
pip3 install --upgrade JobFunnel
pip3 show JobFunnel
into the command line/power shell. GlassDoorDynamic
and GlassDoorStatic
. Both of these scrape GlassDoor in different ways, but do the same thing. Because of this, the way you'll specify GlassDoor in the setiings.yaml
changes a little bit. Instead of having:providers:
- 'Indeed'
- 'Monster'
- 'GlassDoor' # This used to take ~10x longer to run than the other providers
You will change the GlassDoor
setting, and the providers part of your settings.yaml
should look like this now:
providers:
- 'Indeed'
- 'Monster'
- 'GlassDoorStatic'
# - 'GlassDoorDynamic'
Now your entire settings.yaml
file should look like this:
# All paths are relative to this file.
# Paths.
# place the search right next to this file
output_path: './'
# Providers from which to search (case insensitive)
providers:
- 'Indeed'
- 'Monster'
- 'GlassDoorStatic'
# - 'GlassDoorDynamic'
# Filters.
search_terms:
region:
province: 'TX'
city: 'Allen'
domain: 'com'
radius: 30
keywords:
- 'Advertising'
- 'Marketing'
- 'Coordinator'
- 'Account'
- 'Agency'
black_list:
- 'Sales'
- 'Media'
- 'Digital'
- 'Social'
# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'info'
# Saves duplicates removed by tfidf filter to duplicate_list.csv
save_duplicates: False
# Turn on or off delaying
# set_delay: True
# Delaying algorithm configuration
delay_config:
# Functions used for delaying algorithm, options are: constant, linear, sigmoid
function: 'linear'
# Maximum delay/upper bound for converging random delay
delay: 30
# Minimum delay/lower bound for random delay
min_delay: 15
# Random delay
random: True
# Converging random delay, only used if 'random' is set to True
converge: True
Following these steps should fix your problem.
It is unlikely that the GlassDoorStatic
scraper will fail, but if it does, you can always just comment it out of the settings.yaml like this:
# - 'GlassDoorStatic'
Sorry we don't have clear documentation on these changes. Will make sure to update the readme on the next PR to make this clear to users.
Hope this works! Cheers!
Awesome! I will try this as soon as possible.
Issue Template
Description
Standard search produces web scrape error
Steps to Reproduce
Standard search with
Expected behavior
Results of query
Actual behavior
No loglevel
query_words is empty therefore cannot be fit_transform by vectorizer
Debug Loglevel
webdriver manager returning 404 errors?
Variable Contents
prev_dict
cur_dict.values()
query_ids
query_words
Environment
beautifulsoup4>=4.6.3 (4.9.1) lxml>=4.2.4 (4.5.1) requests>=2.19.1 (2.23.0) python-dateutil>=2.8.0 (2.8.1) PyYAML>=5.1 (5.3.1) scikit-learn>=0.21.2 (0.23.1) nltk>=3.4.1 (3.5) scipy>=1.4.1 (1.4.1) selenium>=3.141.0 (3.141.0) webdriver-manager>=2.4.0 (3.1.0) soupsieve>1.2 (2.0.1) certifi>=2017.4.17 (2020.4.5.2) urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 (1.25.9) chardet<4,>=3.0.2 (3.0.4) idna<3,>=2.5 (2.9) six>=1.5 (1.15.0) threadpoolctl>=2.0.0 (2.1.0) joblib>=0.11 (0.15.1) numpy>=1.13.3 (1.18.5) click (7.1.2) tqdm(4.46.1) atomicwrites>=1.0; (1.4.0) packaging (20.4) pluggy<1.0,>=0.12 (0.13.1)