PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Glassdoor Dynamic/Static Switching #77

Closed Arax1 closed 4 years ago

Arax1 commented 4 years ago

Glassdoor Dynamic/Static Switching

Hello everyone! Here's my code for letting the user choose between Dynamic Scraping and Static Scraping for Glassdoor

Description

Basically, the user can now input a new command line argument that tells the Glassdoor scraper which scraping process to actually go with (the options being dynamic and static).

To choose between dynamic or static, the user can input "-g" into the command line while running Jobfunnel and input either "dynamic" or "static" like below

funnel -g dynamic

Major Code Changes

The main change I made to the base code was changing the original glassdoor.py into a generic glassdoor_base.py, that both the dynamic (GlassDoorDynamic) and the static (GlassDoorStatic) scraper can inherit from. GlassDoorDynamic is largely the same original glassdoor scraper, while GlassDoorStatic is an older version of the scraper that I took from commit c1fb87940a20a3cc4479352740579b23a7fa9539.

After changing the validation options and parser to allow a new -g argument, I went into main and added a specific GLASSDOOR list for the dynamic and static child classes....

GLASSDOOR = {
    "static": GlassDoorStatic,
    "dynamic": GlassDoorDynamic
}

Then checked to see if the program was parsing the glassdoor option, and switched to the glassdoor specific providers....

 for p in config["providers"]:
        if "glassdoor" in p:
            provider: Union[GlassDoorDynamic,
                GlassDoorStatic] = GLASSDOOR[config["glassdoorsetting"]](config)

       else:
           provider: Union[GlassDoorBase, Monster,
                           Indeed] = PROVIDERS[p](config)

If the user does not input any -g option, then by default the static option for scraping glassdoor is chosen.

As of right now, this is specifically a fix that lets the user only choose weather they want a dynamic or static scraper for glassdoor, hence why I chose -g as the settings options. If it comes down the line that more providers have a dynamic/static option, then in the future I can easily update this code to have a list of dynamic providers and static providers to choose from. For now, hopefully this is a good solution to let the user switch between Dynamic and Static Glassdoor scraping.

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Checklist:

Please mark any boxes that have been completed.

codecov-commenter commented 4 years ago

Codecov Report

Merging #77 into master will decrease coverage by 1.24%. The diff coverage is 33.33%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #77      +/-   ##
==========================================
- Coverage   46.96%   45.72%   -1.25%     
==========================================
  Files          11       13       +2     
  Lines         990     1111     +121     
==========================================
+ Hits          465      508      +43     
- Misses        525      603      +78     
Impacted Files Coverage Δ
jobfunnel/__main__.py 0.00% <0.00%> (ø)
jobfunnel/glassdoor_dynamic.py 0.00% <0.00%> (ø)
jobfunnel/glassdoor_static.py 27.88% <27.88%> (ø)
jobfunnel/glassdoor_base.py 58.62% <58.62%> (ø)
jobfunnel/config/valid_options.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update cb9f152...461aca3. Read the comment docs.

Arax1 commented 4 years ago

Hi @Arax1! Could you update the tests that use "GlassDoor" to GlassDoorStatic so that they pass :)?

Something that I do wonder about this is that if the approach of having a flag that is specific to the scraper might just lead to more complexity in the future. In the sense that, if we were to implement this switch for Monster, Indeed or any other scrapers, this seems to me like you would have to have an option for each scraper, which might increase code/complexity of the parser.

I wonder if we might be able to prevent this further complexity by making it something like funnel -dy GlassDoor("dy" is short for dynamic) where this option can be used for any scraper. I know you alluded to this, but having static as a choice to me feels redundant because we're only doing dynamic when we have to, which is only when static scraping breaks for some reason. So I guess what I'm trying to say is that I wonder if making the choices argument of the parser for this option all of the available scrapers, something like [Monster, Glassdoor, Indeed] will prevent this option from basically being duplicated for each scraper.

These are just my thoughts when it comes to this.

Excited about this change and what the community's view is on this!

I fixed the issue and changed it to use the glassdoor_static scraper to pass the tests.

thebigG commented 4 years ago

I think maybe I am not fully understanding the difference between the dynamic and static scrapers, but from discussion it sounds like the Static one works but the Dynamic one is still a beta feature that we are intending to improve (@thebigG confirm?)

Can confirm both(dynamic and static) GlassDoor scrapers work. Tested them on my machine and an isolated container(Ubuntu 16.04).

PaulMcInnis commented 4 years ago

Ok, perhaps we can keep both and phase out the static one perhaps.

If its for testing I think its fine to have a flag —experimental which runs any experimental scrapers, like the ones we were discussing in the other open PR.

Since dynamic scraper is good we can default to it.

On Wed, Jun 3, 2020 at 6:51 PM Lorenzo B Gomez notifications@github.com wrote:

I think maybe I am not fully understanding the difference between the dynamic and static scrapers, but from discussion it sounds like the Static one works but the Dynamic one is still a beta feature that we are intending to improve (@thebigG https://github.com/thebigG confirm?)

Can confirm both(dynamic and static) GlassDoor scrapers work. Tested them on my machine and an isolated container(Ubuntu 14.04).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/PaulMcInnis/JobFunnel/pull/77#issuecomment-638501683, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKY2O4U2QGAXKYCTYOWCDRU3HWNANCNFSM4NPO4AHQ .

--

  • Paul McInnis