Closed DannyCork closed 4 years ago
Looks like the indeed scraper needs updating - will get on this asap.
OK, I need a bit more information,
Can you show me your settings.yaml ?
thanks, same settings.yaml file
# This is the default settings file. Do not edit.
# All paths are relative to this file.
# Paths.
output_path: 'search'
# Providers from which to search (case insensitive)
providers:
- 'Indeed'
- 'Monster'
- 'GlassDoor' # This takes ~10x longer to run than the other providers
# Filters.
search_terms:
region:
province: ''
city: 'xxxx'
domain: 'ie'
radius: 25
keywords:
- 'security'
# Black-listed company names
black_list:
- 'yyyyyyyyyy'
# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'
I believe this is because of your search_terms
.
These are the terms that are inserted into the URL. I believe we could improve this software by adding some verification process to the search_terms
field.
However, it is no mistake that the software did not work.
I copied your generated URL into a web browser and got the following.
It is possible by using a country and province/states list to verify geographic locations and produce an error if not found prior to scraping. Perhaps it can be added to the list of things to do?
Yes indeed Brad, So lets pick Dublin as the city
search_terms: region: province: '' city: 'Dublin' domain: 'ie' radius: 25
This generates https://ie.indeed.com/jobs?q=security&l=dublin,+None&radius=25&limit=50&filter=0
note the +None , I believe this is due to the province being null/none ''
The url works fine without +None https://ie.indeed.com/jobs?q=security&l=dublin&radius=25&limit=50&filter=0
I think logic can be added that doesn't add query strings if the settings are empty..
Thanks for the investigation, looks like we need to handle in internationalization for areas without provinces.
Hello there, First of all, thanks a lot for this project !
I just get the same issue, I work-arounded/tested (dirty) only for indeed in french.
Its seems to me that the space in indeed.fr are not simple regular spaces, so using ' ' in regular expression for date is not working, replace with '\s', and the expression are in french (hour=heure, day=jour, month=mois, year=année ...) So in tools.py (ln 21 to 26) the regular expressions for french become
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?année'),
re.compile(r'[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[hH]ier')
maybe using a bigger date_regex and using an offset depending on the locale ? or internationalize the regex with more alternative like in
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:hour|heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:day|d|jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?month|mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?year|année'),
re.compile(r'[tT]oday|[jJ]ust [pP]osted|[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[yY]esterday|[hH]ier')
for now I only work-arounded with bigger date_regex table and offset, quick and dirty ...
also, in indeed.py, line 133 the count of jobs is failing, I think this is the root cause of the 'NoneType'
num_res = int(re.findall(r'f (\d+) ', num_res.replace(',', ''))[0])
I thought too at first it was the province but an empty province is working.
The issue for french is that the separator for thousands is a space, not a comma
for now I work-arounder, still quick and dirty with special depending on the local with
num_res = int(re.findall(r'(\d+)', num_res.replace('\s', ''))[1])
(I do not remember why the [1] is different in my workaround.)
maybe a better solution should be to use re.sub instead of replace ?
re.sub(r'\s|,','',num_res)
instead of
num_res.replace(',', '')
My 2 cents on this issue Again, thanks for this project
Hello all, just wanted to know what was the advancement of this issue? Is there a fix or something which is going to be done about this?
Thank you in advance, Thomas
Short answer: no.
Long answer: no, because the problem is caused by the fact that the job listing websites such as glassdoor, monster, etc typically have slightly different websites depending on the country. This small changes breaks the functionality of JobFunnel since we scrap the job listings using tags which are language depended.
The solution to this problem starts by writing an abstract formulation which allows developers to inherit from this abstract formulation to write the web scraper for a particular country. Ideally this is done in such a way such that it is accessible for many developers who do not yet have their country supported. We are working on this but remember that this is a difficult issue since it requires us to find common pattern across all countries.
Ran $ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml
and got