PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

misuse of abstract base classes + monolithic JobFunnel class + schema validation + localisation #85

Closed PaulMcInnis closed 3 years ago

PaulMcInnis commented 4 years ago

Description

Currently we are using the JobFunnel class for to much, I want to break it down into the following:

Job(object):
    def __init__(self, title: str, company: str, location: str, tags: List[str], post_date: datetime.date, key_id: str, url: str) -> None:
        ...

Scraper(ABC):

    @abstractmethod
    def scrape(self) -> List[Job]:
        pass

main():

    # instantiate scrapers

    # run filter on list of Job

    # dump pickle

    # writeout CSV

Note: if I get to it, I'd also like our filters to be an ABC.

Steps to Reproduce

This is a structural technical debt issue. (n/a)

Expected behavior

Abstract base class should not be halfway abstract, Need seperation between JobFunnel and main() and inherited scrapers.

Actual behavior

JobFunnel being monolithic and half-abstract has allowed us to implement three script-like scrapers which share too many methods, without an actual Job object.

Environment

n/a


Current Status:


Future work:

PaulMcInnis commented 4 years ago

Since I have a long weekend and some beers ahead of me, I am going to take this on.

If you would like some design input, please speak up ASAP.

The result should be easier to work with, but without creating breaking changes.

:beers:

PaulMcInnis commented 4 years ago

I am working on the branch ABCJobFunnel

thebigG commented 4 years ago

This is awesome, Paul! Much needed improvement. When I was doing Indeed testing, I had similar thoughts about our current model in regards to, how you say, misused abstractions+ monolithic model. What I sort of was working on which is similar is here--> https://github.com/thebigG/JobFunnel/commit/dc6b31714f12e881100dae3753471ae361a948f4

Not sure if it will help because your ABC implementation looks well thought out and polished.

I'm really excited to see this change being merged into master! I don't know how long this implementation will take to finish but I'll hold off testing, at least for the scrapers, until these architecture changes come through so we don't end up having tests we'll have to refactor entirely within the month :sweat_smile: .

Really excited for these changes :rocket:

thebigG commented 4 years ago

I'm curious about something; you want to have a class per scraper-per-country? Meaning if Scraper X supports countries A and B, with this new model, we would have class BaseXScraper, class XScraperA and class XScraperB? Am I understanding the thinking with this model correctly? I ask because, at least to me, this model makes a lot of sense. I like it a lot because it really just handles Internationalization, at least at the ABC abstract level, very well. And this might be the way to go to handle internalization because it's hard to assume(just like we've talked about in the past regarding this issue) that job sites will play nicely across different countries when scraping.

Hope this question makes sense.

Cheers!

PaulMcInnis commented 4 years ago

Yeah, defo check out the code as it is rn in the scrapers dir for details, but ive ended up with a BaseIndeedScraper(Scraper), with all the logic and then to define the headers and locales i have very simple inherited classes for canada and american indeed, though in this case i think they are the same.

Defo recommend holding off on making changes for a bit, if you have some time we can work together later to re-up any missing coverage, it should be easier to test alot of this.

thebigG commented 4 years ago

Yeah, definitely! I have a better grip of my schedule now so I should be available throughout the week. I'll be merging the ABC branch with my testing branch(https://github.com/thebigG/JobFunnel/tree/testing) and update tests accordingly with your changes :+1: .

PaulMcInnis commented 4 years ago

OK, i'm just about happy with it, leaving FIXMEs as I go. I've got all the functionality going for indeed right now.

I've built it so that there are no changes to the header of CSV's so as not to suprise people upgrading, we can always do a major upgrade but we ought to provide an upgrade script.

Unfortunately it's a pretty hefty change rn, will make diffs complex for any open branches.

thebigG commented 4 years ago

I saw that you removed config_factory and change_nested_dict which makes a lot of sense because those functions are only used by testing suite. If you don't mind, I'll be placing those functions insise conftest.py so the testing suite can use since all of our testing depend on it. This is all assuming that the way(for the most part) the command line arguments are parsed won't change, aside from some code re-organization(?).

PaulMcInnis commented 4 years ago

Bringing monster up to speed, opted to keep existing CLI args and YAML for now.

In the future we will want to have our schema validation using the localization somehow.

I want to have settings be such that you specify a country and language somehow so that we choose the scraper (if it exists) vs specifying the scraper directly with a convoluted name. Might require some registries or something.

Still work to do! I will definitely appreciate help getting testing back up to speed once the PR is in a better state.

thebigG commented 4 years ago

I'll be changing the testing suite little by little so that, hopefully, when the ABC re-modeling is done, the testing for the new code base is up to speed with the changes.

PaulMcInnis commented 4 years ago

👌 I'm taking a little longer than I wanted, still redesigning a bit, want to get the right level of automation and flexibility.

Enjoying working on jobfunnel right now though, this is a good challenge.

PaulMcInnis commented 4 years ago

ok. since I need to add locale to the settings YAML i'm going to also make some improvements to how we handle defaults, and validate.

PaulMcInnis commented 4 years ago

Ok, I am now happy with the BaseScraper and IndeedScraper.X design.

I recommend checking those out for a clearer idea of how we can test with them, it's pretty flexible but also it makes writing the scrapers pretty quick with loads of validation.

big TODO:s rn are to get Glassdoor and Monster going, and to update all the documentation.

Open to feedback ofc.

it'll be a new major revision update because I have redesigned the YAML and CLI to better reflect the localization that we now support.

thebigG commented 4 years ago

Awesome. Will be updating the testing suite ASAP.

thebigG commented 4 years ago

Looking through the code, there is a lot going on; a lot has changed. So it'll definitely take me sometime to catch up :sweat_smile:. I suspect that testing is going to change drastically with all these changes, so when I'm all caught up, I'll be updating the testing suite to adhere to the new OOP model. It's looking very good by the way! Really like the encapsulation implemented; it'll definitely allow us to extend and maintain JobFunnel with far more ease going forward.

PaulMcInnis commented 4 years ago

Yeah its a big one - really I should have separated it out a bit more but i didn't wanna compromise.

Would super appreciate testing help, indeed and monster are now in a good state - take your time though. Ill fix the build and implement some basic testing before i open the PR.

This will give us internationalization finally too 👍

PaulMcInnis commented 4 years ago

Status: (moved checklist to main desc for visibility)

PaulMcInnis commented 4 years ago

Making progress, just a bit slow because I need to fix the Static glassdoor scraping, which seems to yield a high number of duplicates, still investigating, but I added Wage scraping while I was working on it.

Getting faster at using chrome's inspect elements with beautiful soup 4 to make short work of issues.

Also added a remote field since with COVID remote work is becoming a new norm and we should support scraping it.

thebigG commented 4 years ago

The new code base is making a lot of sense. I will very soon start testing the new parser, which is on cli.py. Yeah, the remote field makes a lot of sense.

PaulMcInnis commented 4 years ago

RE: dynamic and static glassdoor implementation, it makes more sense for us to use the webdriver only when we either get CAPTCHA'd, to bypass the captcha, or to deal with issue #87 . Due to that issue I've also disabled Glassdoor for now.

I've also added REMOTE and WAGE jobfelds which as this information is available on indeed and monster.

PaulMcInnis commented 4 years ago

@thebigG

The new code base is making a lot of sense. I will very soon start testing the new parser, which is on cli.py. Yeah, the remote field makes a lot of sense.

Glad to hear it! I think this will make it alot easier to maintain and we can see about getting more internationalized scrapers once this goes in.

FYI I Just pushed some further fixes for cli.py which further improve the schema defaults and cli interaction

PaulMcInnis commented 4 years ago

ending up spending some time with TFIDF vectorizer again, need to streamline filtering in general. I'm also improving the error handling in this method as well.

Ideally we want to be able to prempt individual jobs before we are getting/setting any delayed stuff. This is accomplishable but warrants a class-based TFIDF tool.

I also went through my todo list and captured everything, this is getting really big but I am aiming to open a PR by the end of this weekend, as most of the heavy lifting is now done.

Review the TODO for an idea of what will still be changing.

PaulMcInnis commented 4 years ago

Big update to filtering, now we emit lists of DuplicateJob which allows us to handle updates to our master CSV when we see new duplicates.

PaulMcInnis commented 4 years ago

Reviewed the current open issues, this also addresses / makes possible to fix:

83 - added preemption for job get/set when an attribute fails a check.

12 - added warnings and handling for empty search results

60, #45, #37 - added support for localised scrapers with custom URLs and customisation of scraping without needing to copy/re-write lots of code (via inheritance)

PaulMcInnis commented 4 years ago

Identified a major bug which is that we need to manage getting and setting job fields which have dependencies.

i.e. worker 1 needs Job.KEY_ID to be populated so it can add a Job.DESCRIPTION, but worker 2 is getting or setting other things.

Im going to try and resolve this using asyncio so that workers can perform non-blocking waits for their needed fields to be populated.

Ideally we would have clear dependencies configured so we could calculate a dependency map and work from leaves-first, but I think await will suffice.

PaulMcInnis commented 4 years ago

In resolving this synchronisation issue, I will also need to perform on-the-fly delaying, and a queue for get requests so that they can be handled in one place. This will have the advantage of not requiring the scraper to define what get/set operations require delaying and should let us run at maximum speed (i.e. delays are exactly the minimum).

The remaining issue is to somehow share Job amongst the workers so that when we are setting a job field with another job field we aren't needing to re-calculate the pre-requisite - so far the only solution I can think of is some kind of queue or event with callbacks for emitting worker results, or a job-wise cache with access guarded by mutex.

thebigG commented 4 years ago

I have already started testing. Like I said before starting to test cli.py and will take it from there.

Love the new JobFunnel architecture by the way, great job :rocket:

PaulMcInnis commented 4 years ago

@thebigG FYI, I've just fixed a bug with the CLI parser where YAML paths were not being respected.

PaulMcInnis commented 4 years ago

Almost there, just need to do some testing to discover all the bugs I don't know about yet.

Feel free to try it out anyone, if you find a bug post about it here.

PaulMcInnis commented 4 years ago

fixed another bug I found where non-existant config keys were being accessed, pushed already @thebigG

Starting to wrap up the work now, but I need to add a Session class that will handle respectul delaying so that workers cannot perform staggered delays.

PaulMcInnis commented 4 years ago

OK looks like there is still CLI parsing bugs... going to need to write some actual unit tests to fix this properly. Will spend some time on this in the coming week so we can prepare to merge.

Interested in getting this out within next 2 weeks so that we can fix broken scrapers on current release.