PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.85k stars 215 forks source link

failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste' #35

Closed DannyCork closed 4 years ago

DannyCork commented 4 years ago

I get an error when running funnel, an exception gets caugh then scraping Indeed and then moves on to Monster...

 $ funnel -s JobFunnel/jobfunnel/config/settings.yaml

jobfunnel initialized at 2020-01-05
jobfunnel indeed to pickle running @ 2020-01-05
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 362
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0 HTTP/1.1" 200 None
Found 291 indeed results for query=security
getting indeed page 0 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0
getting indeed page 1 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50
Starting new HTTP connection (1): www.indeed.ie
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 2 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 3 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 4 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200
Starting new HTTP connection (1): www.indeed.ie
getting indeed page 5 : http://www.indeed.ie/jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250
Starting new HTTP connection (1): www.indeed.ie
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 375
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 374
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 376
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://www.indeed.ie:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 376
Starting new HTTP connection (1): ie.indeed.com
Starting new HTTP connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx%2C+&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 None
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 301 0
http://ie.indeed.com:80 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 301 0
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
Starting new HTTPS connection (1): ie.indeed.com
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=250 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=0 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=100 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=150 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=50 HTTP/1.1" 200 None
https://ie.indeed.com:443 "GET /jobs?q=security&l=xxxx,&radius=25&limit=50&filter=0&start=200 HTTP/1.1" 200 None
failed to scrape Indeed: invalid literal for int() with base 10: 'Just poste'
jobfunnel monster to pickle running @ : 2020-01-05
Starting new HTTPS connection (1): www.monster.ie
https://www.monster.ie:443 "GET /jobs/search/?q=security&whe

Notes that I've changed the location to xxxx for posting purposes.

The settings file looks like this

# This is the default settings file. Do not edit.

# All paths are relative to this file.

# Paths.
output_path: 'search'

# Providers from which to search (case insensitive)
providers:
  - 'Indeed'
  - 'Monster'
  - 'GlassDoor' # This takes ~10x longer to run than the other providers

# Filters.
search_terms:
  region:
    province: ''
    city:     'xxxx'
    domain:   'ie'
    radius:   25

  keywords:
    - 'security'

# Black-listed company names
black_list:
  - 'yyyyyyyyyy'

# Logging level options are: critical, error, warning, info, debug, notset
log_level: 'debug'
~
PaulMcInnis commented 4 years ago

Ah yeah looks like the scraper needs updating to handle this case.

Will get on this.

In the mean time, can you try with this fork which has improved regexes? https://github.com/bunsenmurder/JobFunnel

studentbrad commented 4 years ago

The fork has now been merged with master as version 2.0.0.

studentbrad commented 4 years ago

May be duplicate. Similar to #37. Likely fails for the same reason.

studentbrad commented 4 years ago

Closing this issue due to inactivity. However, it has not been forgotten. In the meantime be careful what you pass to JobFunnel in your settings.yaml. Ideally we would want some verification to the config before passing as args to JobFunnel. Stay tuned.