PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.85k stars 215 forks source link

failed to scrape Indeed: 'NoneType' object has no attribute 'contents' #37

Closed DannyCork closed 4 years ago

DannyCork commented 4 years ago

Ran $ funnel -s /home/danny/JobFunnel/jobfunnel/config/settings.yaml

and got

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 366
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,+None&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
....
PaulMcInnis commented 4 years ago

Looks like the indeed scraper needs updating - will get on this asap.

PaulMcInnis commented 4 years ago

OK, I need a bit more information,

Can you show me your settings.yaml ?

DannyCork commented 4 years ago

thanks, same settings.yaml file


# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'
studentbrad commented 4 years ago

I believe this is because of your search_terms. These are the terms that are inserted into the URL. I believe we could improve this software by adding some verification process to the search_terms field. However, it is no mistake that the software did not work.

I copied your generated URL into a web browser and got the following. Screenshot from 2020-01-07 00-39-01

It is possible by using a country and province/states list to verify geographic locations and produce an error if not found prior to scraping. Perhaps it can be added to the list of things to do?

DannyCork commented 4 years ago

Yes indeed Brad, So lets pick Dublin as the city

search_terms: region: province: '' city: 'Dublin' domain: 'ie' radius: 25

This generates https://ie.indeed.com/jobs?q=security&l=dublin,+None&radius=25&limit=50&filter=0

note the +None , I believe this is due to the province being null/none ''

The url works fine without +None https://ie.indeed.com/jobs?q=security&l=dublin&radius=25&limit=50&filter=0

I think logic can be added that doesn't add query strings if the settings are empty..

PaulMcInnis commented 4 years ago

Thanks for the investigation, looks like we need to handle in internationalization for areas without provinces.

remidubroca commented 4 years ago

Hello there, First of all, thanks a lot for this project !

I just get the same issue, I work-arounded/tested (dirty) only for indeed in french.

Its seems to me that the space in indeed.fr are not simple regular spaces, so using ' ' in regular expression for date is not working, replace with '\s', and the expression are in french (hour=heure, day=jour, month=mois, year=année ...) So in tools.py (ln 21 to 26) the regular expressions for french become

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?année'),
re.compile(r'[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[hH]ier')

maybe using a bigger date_regex and using an offset depending on the locale ? or internationalize the regex with more alternative like in

re.compile(r'(\d+)(?:[\s+]{1,3})?(?:hour|heure|hr)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?(?:day|d|jour|j)'),
re.compile(r'(\d+)(?:[\s+]{1,3})?month|mois'),
re.compile(r'(\d+)(?:[\s+]{1,3})?year|année'),
re.compile(r'[tT]oday|[jJ]ust [pP]osted|[aA]ujourd\'hui|[pP]ubliée à l\'instant'),
re.compile(r'[yY]esterday|[hH]ier')

for now I only work-arounded with bigger date_regex table and offset, quick and dirty ...

also, in indeed.py, line 133 the count of jobs is failing, I think this is the root cause of the 'NoneType' num_res = int(re.findall(r'f (\d+) ', num_res.replace(',', ''))[0]) I thought too at first it was the province but an empty province is working. The issue for french is that the separator for thousands is a space, not a comma for now I work-arounder, still quick and dirty with special depending on the local with num_res = int(re.findall(r'(\d+)', num_res.replace('\s', ''))[1]) (I do not remember why the [1] is different in my workaround.)

maybe a better solution should be to use re.sub instead of replace ? re.sub(r'\s|,','',num_res) instead of num_res.replace(',', '')

My 2 cents on this issue Again, thanks for this project

tgdn commented 4 years ago

Hello all, just wanted to know what was the advancement of this issue? Is there a fix or something which is going to be done about this?

Thank you in advance, Thomas

markkvdb commented 4 years ago

Short answer: no.

Long answer: no, because the problem is caused by the fact that the job listing websites such as glassdoor, monster, etc typically have slightly different websites depending on the country. This small changes breaks the functionality of JobFunnel since we scrap the job listings using tags which are language depended.

The solution to this problem starts by writing an abstract formulation which allows developers to inherit from this abstract formulation to write the web scraper for a particular country. Ideally this is done in such a way such that it is accessible for many developers who do not yet have their country supported. We are working on this but remember that this is a difficult issue since it requires us to find common pattern across all countries.