PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Blurbs are still being retrieved for filtered out jobs #83

Closed bunsenmurder closed 3 years ago

bunsenmurder commented 4 years ago

Description

Currently the scraper is still retrieving blurbs for jobs that have been filtered out by the _prefilter method.

Please include a summary of the issue. Please include the steps to reproduce. List any additional libraries that are affected.

Steps to Reproduce

  1. Run JobFunnel under any query and make sure the results are saved to a directory without a master_list.csv or duplicate_list.csv file.
  2. Run the scraper again and take the note of the amount unique jobs found by the _prefilter, then count the amount of individual jobs that are being scraped. You should notice that they don't match.

Expected behavior

The scraper should remove jobs identified by the by the _prefilter, and only obtain blurbs for the remaining jobs.

Actual behavior

The scraper retrieves blurbs for all jobs whether they were filtered out or not.

To fix the issue, the order of the creation of the _scrapelist and call to the _prefilter method would have to be switched. The screenshot below highlights the issue within the code and the debugger output : image

Although this could've of been fixed in a pull request, making this fix would break _datefilter called by the _prefilter method in the main JobFunnel class.

Environment

PaulMcInnis commented 4 years ago

thank-you for the detailed write-up!

(looks like it's time to do some more thorough code review in the codebase)

PaulMcInnis commented 4 years ago

ah oops should have done this before I drafted a release just now. Need to fix this and some other behaviour issues and up the sub-rev.

bunsenmurder commented 4 years ago

Perfect timing actually, I was gonna make a pull request with some fixes I made.

PaulMcInnis commented 4 years ago

ah nice! glad to hear it!

Feel free to up the rev to 2.1.9