PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Indeed Tests, Refactoring and Docker #82

Closed thebigG closed 4 years ago

thebigG commented 4 years ago

Indeed Tests

Hello everyone, Hope you are all doing well!

Description

I have added unit tests to jobfunnel/indeed.py. These tests were challenging because my goal was to isolate every piece of functionality of the indeed scraper so that if something fails in the future, we can easily find the point of failure. The unit tests were even more challenging because I aimed to make these tests as reliable as one can make them to be, given that they do depend on the network. I did not implement any flaky tests because flaky tests are not as reliable, given that they are flaky tests. Rather the approach I took was to, in some cases, have loose assertions. What I mean by this is that in cases where there was something that was not crucial to the job listing, such as the blurb, the only thing I ensure is that some(at least 1) job listing is returning a blurb of some kind. The data that I considered essential was the job id and the job link, since without these two, we don't have a job to apply to. The job id test test_get_id is the most strict test because without a valid job id we do not have a job at all. By most strict I mean if any one job posting fails to yield an id, the entire pytest suite fails. Note that I don't test tags at all because they are added "bonus" data we give to users, and many sites don't even yield any tags whatsoever. Hopefully this is clear in the tests on indeed.py.

Another challenge I came across was the fact that the way we were fetching data such as date, id, title, etc was not very well suited for testing in the sense that did not have these functionality in functions. Because of this I found myself copy pasting this code for all of the tests, and that's not a very efficient or maintainable way to write our tests. So I took the liberty of refactoring all of this functionality and writing wrapper functions for it and making them part of jobfunnel.py and indeed.py. You can see the details of this refactoring change on this commit b33c18c329e55894a35313651586128cef87b89e. Hopefully we can implement this type of refactoring across all of the scrapers and make that code a little easier to maintain.

readme updates and Docker

I have updated the readme and added some information about our new GlassDoor dynamic and static scheme. Hopefully that is clear enough for users.

I was wondering what your thoughts are on adding a Ubuntu Xenial Docker image pre-built for JobFunneel where developers can pull from and have a clean environment to do testing on? I do testing like this all the time to make sure my development environment does not affect JobFunnel testing and I think other people could benefit from having an image on Docker hub where they don't have to install all of the python dependencies manually and can just do docker pull or something like that.

Hope these changes make sense, Cheers!

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Checklist:

Please mark any boxes that have been completed.

codecov-commenter commented 4 years ago

Codecov Report

Merging #82 into master will increase coverage by 12.56%. The diff coverage is 80.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master      #82       +/-   ##
===========================================
+ Coverage   45.80%   58.37%   +12.56%     
===========================================
  Files          13       13               
  Lines        1109     1146       +37     
===========================================
+ Hits          508      669      +161     
+ Misses        601      477      -124     
Impacted Files Coverage Δ
jobfunnel/jobfunnel.py 47.68% <50.00%> (+15.18%) :arrow_up:
jobfunnel/indeed.py 87.32% <94.11%> (+59.22%) :arrow_up:
jobfunnel/glassdoor_dynamic.py 16.52% <0.00%> (+16.52%) :arrow_up:
jobfunnel/__main__.py 35.89% <0.00%> (+35.89%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update df8b8a8...8684f9d. Read the comment docs.

thebigG commented 4 years ago

looking good!

only nitpick is we should wrap lines at 100 chars, but I dont want that to hold this up.

Yes! Just changed editors(using geany now) recently 😅 Will be sure to enforce this in the future.

bunsenmurder commented 4 years ago

Hey @thebigG really appreciate your addition of testing to the code-base, definitely makes the project seem more professional. Although the pull has already been merged, I left a comment regarding a change to the Regex code in indeed.py which could have a performance impact.

thebigG commented 4 years ago

Although the pull has already been merged, I left a comment regarding a change to the Regex code in indeed.py which could have a performance impact.

I saw. Thanks a lot for pointing this out! I don't have time right now to look at these. I will have a new schedule/routine so need some time to re-adjust and find a new time slot for JobFunnel.