PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.78k stars 210 forks source link

Released version 2.1.8 failed on March-28-2021 #140

Closed Nllii closed 3 years ago

Nllii commented 3 years ago

Description

pip3 install git+https://github.com/PaulMcInnis/JobFunnel.git@2.1.8 Being using this a couple of years now. For some reason, this failed. What I have done so far:

  1. Deleted all the data files in search(master_list.csv, jobfunnel.log,jobs_2021-03-22.pkl,jobs_2021-03-28.pkl,filter_list.json)
  2. Disabled adblocker.

Error

admin@Admins-MacBook-Pro ~ % bash job.sh                                                        
finding you jobs
jobfunnel initialized at 2021-03-28
no master-list, filter-list was not updated
jobfunnel indeed to pickle running @ 2021-03-28
failed to scrape Indeed: 'NoneType' object has no attribute 'contents'
jobfunnel monster to pickle running @ 2021-03-28
failed to scrape Monster: 'NoneType' object has no attribute 'text'
jobfunnel glassdoor to pickle running @ 2021-03-28
failed to scrape GlassDoor: 'NoneType' object has no attribute 'text'
Traceback (most recent call last):
  File "/usr/local/bin/funnel", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/jobfunnel/__main__.py", line 48, in main
    jp.update_masterlist()
  File "/usr/local/lib/python3.8/site-packages/jobfunnel/jobfunnel.py", line 358, in update_masterlist
    raise ValueError("No scraped jobs, cannot update masterlist")
ValueError: No scraped jobs, cannot update masterlist
DONE
admin@Admins-MacBook-Pro ~ % 

Apart from Google and Youtube insisting on captcha every 3 hours for my IP, this has become unusable. The traffic coming from my machine is this code running.

PaulMcInnis commented 3 years ago

We are working on a release with a number of changes which may help.

Are you able to test the current master of this repository? This is best done by installing in-place, you should backup any masterlist and filterlists.

PaulMcInnis commented 3 years ago

Seems we are having issues with scraping due to a regex.

PaulMcInnis commented 3 years ago

OK just merged a PR that may fix this, but you should try using current master

Nllii commented 3 years ago

Are you able to test the current master of repository? Yes, I checkout the master repo last year, I had to revert back to 2.1.8. 2.1.8 was faster and straight forward nothing fancy.

I don't know if this helps, but, if the end-user already has a copy of 2.1.8 on kaggle and re-runs it again this is the outcome. https://www.kaggle.com/bellphegor/job-search

  1. It will filter the jobs --max_listing_days 2 and find jobs on indeed to add to the csv file after filtering
  2. Then it will fail when the end-user runs it again.
  3. Why does it fail the second time when run. I will try to get the current masterlist and filterlists from kaggle to duplicate the outcome.

shell

jobfunnel indeed to pickle running @ 2021-03-29
Found 4 indeed results for query=phlebotomist
getting indeed page 0 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=0
getting indeed page 1 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=50
getting indeed page 2 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=100
getting indeed page 3 : http://www.indeed.com/jobs?q=phlebotomist&l=HOUSTON%2C+TX&radius=25&limit=50&filter=0&start=150
date_filter running

delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=ac44060dadbe32b3
delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=a028d791865bb433
delay of 10.00s, getting indeed search: http://www.indeed.com/viewjob?jk=db0271a737679ea2
indeed scrape job took 206.649s
jobfunnel monster to pickle running @ 2021-03-29
failed to scrape Monster: 'NoneType' object has no attribute 'text'
no jobs filtered, missing search/data/filter_list.json
removed 0 jobs in blacklist from master-list
Found and removed 6 re-posts/duplicates via TFIDF cosine similarity!
no masterlist detected, added 5 jobs to search/master_list.csv
done. see un-archived jobs in search/master_list.csv
Nllii commented 3 years ago

OK just merged a PR that may fix this, but you should try using current master

Awesome thanks, I will update the module.

PaulMcInnis commented 3 years ago

I just cut a release as well, so you should be able to simply try out 3.0.2.

I hear you on the complexity increase as well, flexibility definitely has a cost.

Given that we don't really have any upgrade or versioning plan currently, I would maintain a backup of all my search results as much as possible.

The older code has flakey match and update code which the newer versions with TFIDF & id matching can help with.

Nllii commented 3 years ago

Awesome thanks, released version 3.0.2 works. shell

bash job.sh       
finding you jobs
[2021-03-29 19:31:31,209] [INFO] JobFunnel: Scraping local providers with: ['IndeedScraperUSAEng']
[2021-03-29 19:31:40,281] [INFO] IndeedScraperUSAEng: Found 4 pages of search results for query=Phlebotomist
[2021-03-29 19:31:48,047] [INFO] IndeedScraperUSAEng: Scraped 188 job listings from search results pages
100%|#######################################################| 188/188 [02:18<00:00,  1.35it/s]
[2021-03-29 19:34:07,240] [INFO] JobFunnel: Completed all scraping, found 188 new jobs.
[2021-03-29 19:34:07,394] [INFO] JobFunnel: Done. View your current jobs in demo_job_search_results/demo_search.csv
DONE