PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

JobFunnel 3.0 with localization, ABC and improved scraping #90

Closed PaulMcInnis closed 3 years ago

PaulMcInnis commented 4 years ago

Description

This is version 3.0 of JobFunnel with numerous improvements including:

This will affect anyone currently developing off of the old branch, as the rebase will be un-tenable. I may need to squash this down a lot more.

If you are reading this, please give this branch a go, I find the easiest non-distruptive way is just to clone this repo as ABCJobFunnel and simply run

cd ABCJobFunnel
python3 -m jobfunnel

A good place to start is

Issues affected:

Context of change

Type of change

I have updated all documentation.

Existing master CSV files can be ported by adding missing columns, but it is recommended just to start fresh. Existing cache files and block lists are not compatible, block lists could however be made compatible, this one might be worth pursuing.

How Has This Been Tested?

General monkey testing, but I need to up test coverage to be truly confident that the code quality is there. Would appreciate anyone reading this to just try running it and to try breaking it. Respond here with any bugs you find.

Checklist:

Additional TBD:

PaulMcInnis commented 4 years ago

FYI I've put this up before I've re-upped the coverage / fixed the pyenv to make it accessible. Fixing the coverage will take some time, but I don't anticipate making any further large changes to the structure of the codebase.

bunsenmurder commented 4 years ago

I reviewed as much as I could but ended up stopping part way as this update seems to break debugging using PyCharm. I was able to run JF normally, but debugging would cause it to stall out indefinitely. The issue seems to stem from the use of properties within this new version; more details about the issue can be found within this thread on Jetbrain's support forum.

PaulMcInnis commented 4 years ago

Thanks for taking a look guys, I'll be fixing the CLI issues tomorrow, I might need to add some functional testing as well to make sure I've smoke tested this a bit better (in lieu of complete unit testing)

additionally, it seems that pyenv sync doesn't work with the jobfunnel dependency, not sure what's up with that yet though.

PaulMcInnis commented 4 years ago

It would also seem that USA_ENGLISH locale is broken for the default settings.yaml, need to look into this.

codecov-commenter commented 4 years ago

Codecov Report

Merging #90 into master will decrease coverage by 21.50%. The diff coverage is 36.94%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #90       +/-   ##
===========================================
- Coverage   58.34%   36.83%   -21.51%     
===========================================
  Files          13       22        +9     
  Lines        1150     1341      +191     
===========================================
- Hits          671      494      -177     
- Misses        479      847      +368     
Impacted Files Coverage Δ
jobfunnel/__main__.py 0.00% <0.00%> (-35.90%) :arrow_down:
jobfunnel/backend/jobfunnel.py 0.00% <0.00%> (ø)
jobfunnel/backend/tools/delay.py 21.15% <21.15%> (ø)
jobfunnel/backend/tools/filters.py 21.27% <21.27%> (ø)
jobfunnel/backend/job.py 26.47% <26.47%> (ø)
jobfunnel/backend/scrapers/monster.py 28.35% <28.35%> (ø)
jobfunnel/config/manager.py 29.78% <29.78%> (ø)
jobfunnel/backend/tools/tools.py 29.87% <29.87%> (ø)
jobfunnel/backend/scrapers/glassdoor.py 30.14% <30.14%> (ø)
jobfunnel/backend/scrapers/indeed.py 30.90% <30.90%> (ø)
... and 33 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 5275820...cbbd917. Read the comment docs.

PaulMcInnis commented 4 years ago

Verified that USA_ENGLISH functions for Indeed and Monster, added a locale scrape for USA_ENGLISH to round things out a bit more on Travis.

PaulMcInnis commented 4 years ago

having a bit of a time with the CLI vs YAML vs defaults still, Made some progress just now just need to add the defaults injection.

I was having a hard time writing tests for the methods I made, so I've broken it down a bit further. Going to take another crack at it later.

I guess I can see now why most programs allow just YAML or just CLI

PaulMcInnis commented 4 years ago

The fact that this is so hard to test indicates to me that perhaps we shouldnt let user mix YAML and cli arguments and default values. It gets bogged down in invalid cases and combinational cases...

Perhaps we can make the yaml and cli mutually exclusive?

PaulMcInnis commented 3 years ago

OK, I'm just working on getting a few final things in, but seperating the CLI out made things alot easier. Finally moving past that mess and added some simple tests to verify It actually works.

PaulMcInnis commented 3 years ago

OK, I've tested this enough for now.

Master is pretty broken compared to this so I'm going to merge and fix bugs as they come in from now on.

Still TODO: [ ] Inter-scrape duplicates by TFIDF [ ] GlassDoor scraper (webdriven) [ ] more testing