PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

Dynamic web scraping and static web scraping #75

Closed thebigG closed 4 years ago

thebigG commented 4 years ago

Dynamic web scraping and static web scraping

Hello everyone! Hope you are all doing fantastic!

Description

So after changing to dynamic web scraping, it looks like the way the date of each job is displayed on Glassdoor changes. This was breaking the date_filter, but it looks like it's working now after digging around Glassdoor's html source. I also streamlined some of the dynamic scraping workflow to just go on with the scraping(not holding the program with an input() call) if there is nothing wrong, such as a CAPTCHA requirement.

I went back and tried the static way of scraping Glassdoor; it works sometimes, other times it doesn't. This led me to think about this dynamic/static scraping issue. And I'm wondering if you guys think we should enable JobFunnel to support both--dynamic and static web scraping. After some thought I imagined a couple of ways of implementing this:

  1. Outsource to a non-scraping entity(such as tools.py) the ability of telling whoever is building the scraper if the site supports at the moment static web scraping. Hopefully an example might illustrate this better:
def does_static_work_for_this_site(provider: str):
    ...some logic to figure this out
    ...return true if it works, false if it doesn't

From the perspective of a scraper such as glassdor.py the implementation might look like this:

def scrape(self):
     if does_static_work_for_this_site("glassdoor"):
           ...some static scraping logic
    else:
            ...some dynamic scraping logic
     .....some more logic

The nice thing about this approach is that anybody who might write a new webscraper can add their own implementation to does_static_work_for_this_site and it is also decoupled from the actual scraping, whether that scraping is dynamic or static. The downside of this approach is that scrapers will be littered with if ... else logic like the one above all over the place, and that approach might sacrifice some readability. But on the other hand, abstracting this is hard as has been mentioned before in regards to internationalization implementations.

  1. Let the user decide whether to do a dynamic scrape or a static one. If one were to take this approach, one could decouple the dynamic and static approach entirely from each other. An implementation that comes to mind is to have different files for each implementation such as this:

The upside of this is that the dynamic and static approaches are completely separate from each other and not intertwined in one monolithic file littered with if..else logic. The downside is that in a way the codebase gets more complex.

In the end, both approaches make the whole codebase more complex. But the reason why I propose this is that I wonder if other sites might behave in a similar way to Glassdoor, or who knows if Indeed or Monster might start behaving the same way. I also don't think it's a very efficient way of handling things to have somebody just jump in whenever the current approach(whether it be static or dynamic) and have them re-write one or the other implementation again. I wonder what you guys think about this as a way of being prepared for the future if we need to add dynamic support for other sites, or even if new sites that are added might behave very much like Glassdoor;sometimes dynamic works, other times static is the way to go.

Sorry for making this long! Just wanted to put thoughts down to see what y'all think! I also wonder if maybe there are better ways of implementing this dynamic/static support, or should it even be implemented at all?

Cheers!

List any additional libraries that will be affected. List any developers that will be affected or those who you had merge conflicts with.

Context of change

Please add options that are relevant and mark any boxes that apply.

Type of change

Please mark any boxes that apply.

How Has This Been Tested?

Please describe the tests that you ran to verify your changes. Provide instructions so we can reproduce. Please also list any relevant details for your test configuration.

Checklist:

Please mark any boxes that have been completed.

codecov-io commented 4 years ago

Codecov Report

Merging #75 into master will increase coverage by 0.04%. The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.01%   +0.04%     
==========================================
  Files          11       11              
  Lines         990      989       -1     
==========================================
  Hits          465      465              
+ Misses        525      524       -1     
Impacted Files Coverage Δ
jobfunnel/glassdoor.py 12.57% <0.00%> (+0.07%) :arrow_up:
jobfunnel/tools/tools.py 80.20% <ø> (ø)
jobfunnel/tools/filters.py 93.10% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1c3bec6...dc8ff18. Read the comment docs.

studentbrad commented 4 years ago

If you check the JobFunnel Wiki I actually have done some work on improving methods of scraping. I'm not sure if we should continue with our current method or move to another method entirely. I used cosine similarity to do dynamic web scraping based on keywords. However, Glassdoor has proved to be the most challenging even for dynamic scraping. This is because they do not label any tags properly based on their contents like Indeed or Monster.

Checkout my branch if you want to see what I believe dynamic scraping should look like. https://github.com/PaulMcInnis/JobFunnel/tree/studentbrad/tfidf_scrape

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

thebigG commented 4 years ago

The other option I have started to look at is to not rely on HTML tags, but to use word vectorizers to process raw text in HTML files.

This is really neat! It's a very clever way of doing this as it's completely decoupled from the scrapers themselves, whether they are Glassdoor or Monster. At the end of the day, we just get a dump of html; it doesn't matter if it's selenium or static scraping. If you ask me, this seems to be the way to go. If I understand your approach correctly, this is as abstracted as it gets.

PaulMcInnis commented 4 years ago

I guess one thing about JobFunnel was that it relies on community updating of the scrapers to work, but I've always had in mind that we should switch to something with more abstract scrapers.

It would be very cool to use the raw HTML and scrape what we need from it without having to navigate the structure of the data (which is changing)

I get the motivation for the static and dynamic scraping, I think perhaps it is best to allow it to be enabled/disabled, this way we can continue to develop it and use it without relying on it entirely.

I think it makes the most sense to implement in the abstract Scraper class as you mention, with an attribute/method that we can implement, perhaps it can default to False since existing is static.

PaulMcInnis commented 4 years ago

excited to see the scraping being made more flexible!

markkvdb commented 4 years ago

I just read the Wiki about the suggested method for scraping websites using the Cosine Similarity Method. This is definitely an innovative way to find the relevant fields on the website but I am not sure whether it will correctly classify the right fields all the time. Furthermore, the added burden of supporting more countries will become even harder (I think).

That said, I could definitely be a cool R&D project to see how well this actually works in practice. We could abstract the actual parser of a job listing and simply define two objects: a naive static attribute parser and the novel dynamic similarity parser.

PaulMcInnis commented 4 years ago

yeah we need something more abstract ideally, but realistically we should keep our hard-earned scraping logic for now, unless we can prove that a bag-of-HTML approach is robust enough.

Looks like the discussion of doing scraping this way is more food for thought and less related to this PR though, which is doing a more flexible HTML search, definitely an upgrade.

thebigG commented 4 years ago

Furthermore, the added burden of supporting more countries will become even harder (I think).

Indeed. Internationalization could make the vectorizer approach messy. But part of me is really hopeful and optimistic that we could perhaps make this approach work for different countries/languages. But just like @markkvdb said, it might be best to keep it as an in-research strategy, if we're not 100% sure of its robustness.

If I understood Paul's view on static/dynamic approach correctly:

PaulMcInnis commented 4 years ago

perhaps we can just add an attribute to our base scraper class, and have it default to false. I'm not sure it needs to be a method unless it will be based on some logic? If so it can be a method instead.

Right now our base class is JobFunnel but in the future it would make more sense to perhaps build a BaseScraper since inheriting JobFunnel is a lot.

thebigG commented 4 years ago

I'm not sure it needs to be a method unless it will be based on some logic?

I think this might be the crux of the issue here. Initially my thinking was to have some logic where the scraper will figure out if static scraping was available. But after re-considering, I think that approach might make the codebase unnecessarily messy and complex. I think it might be much simpler to just implement a simple bool attribute like Paul suggested. Having a simple bool(or even a multi-value data type like enum) attribute is a win-win for both programmers and users. Users may choose dynamic scrape over static, or vice-versa. Programmers, on the other hand, have the flexibility of implementing each scraper how they see fit.

Hopefully, last question I have about this: Given that we implement a simple attribute to know if it should be dynamic or static, in the case of GlassDoor, will it be a sensible solution to have two files glassdoor_static.py and glassdoor_dynamic.py for each implementation?

I lean towards this because it might be easier to maintain as both implementations are clearly separate from each other. On second thought, it might also lead to a good deal of code duplication.

codecov-commenter commented 4 years ago

Codecov Report

Merging #75 into master will increase coverage by 0.09%. The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #75      +/-   ##
==========================================
+ Coverage   46.96%   47.06%   +0.09%     
==========================================
  Files          11       11              
  Lines         990      988       -2     
==========================================
  Hits          465      465              
+ Misses        525      523       -2     
Impacted Files Coverage Δ
jobfunnel/glassdoor.py 12.65% <0.00%> (+0.15%) :arrow_up:
jobfunnel/tools/tools.py 80.20% <0.00%> (ø)
jobfunnel/tools/filters.py 93.10% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1c3bec6...31be644. Read the comment docs.

PaulMcInnis commented 4 years ago

I think it's ok to have them separate as you describe while we develop the feature. They are essentially different scrapers.

The only consideration I think wont be properly addressed if we split them into entirely seperate files is that any shared logic will have to be updated in two places in future PR's.

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

where GlassdoorScraperBase has all the logic excluding def scrape(...) which is to be implemented by the child-classes.

Looking forwards to seeing us take on building more generic/abstract scrapers in the future - It's a challenge for sure, makes sense to keep it on the back-burner.

thebigG commented 4 years ago

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing.

PaulMcInnis commented 4 years ago

Yeah don't worry too much about getting the test coverage up for the dynamic scraper given that its more experimental is another thought.

On Thu, May 21, 2020 at 7:37 PM Lorenzo B Gomez notifications@github.com wrote:

Perhaps we can just make GlassdoorScraperBase(ABC) with children GlassdoorStaticScraper(GlassdoorScraperBase) GlassdoorDynamicScraper(GlassdoorScraperBase)

This makes a lot of sense. Just wanted to make sure what the future of Glassdoor might be, given how volatile it can be, before writing any tests for it. I think it might actually be better to start implementing the GlassDoorScraperBase classes before moving on with any further testing.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/PaulMcInnis/JobFunnel/pull/75#issuecomment-632398723, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKY2KCPKCKTSYVXOSARPTRSW3KDANCNFSM4NDROBRQ .

--

  • Paul McInnis
thebigG commented 4 years ago

Just wanted to let you guys know that a good friend of mine has started working on a possible solution to this problem of Dynamic/Static scraping-->https://github.com/Arax1/JobFunnel

PaulMcInnis commented 4 years ago

Ok, what's the plan for merging this and #77 It looks like the other one needs to go in first and this one gets rebased afterwards.

thebigG commented 4 years ago

It looks like the other one needs to go in first and this one gets rebased afterwards.

Not a problem! I will rebase it and push as soon as #77 gets merged :+1:

thebigG commented 4 years ago

Do you know if this resolves issue #79 ?

I haven't been able to re-produce this issue on Windows10 or Linux. Will be investigating #79 further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok.

PaulMcInnis commented 4 years ago

Same i also was unable to reproduce the issue

On Wed, Jun 17, 2020 at 2:21 PM Lorenzo B Gomez notifications@github.com wrote:

Do you know if this resolves issue #79 https://github.com/PaulMcInnis/JobFunnel/issues/79 ? I haven't been able to re-produce thisissue on Windows10 or Linux. Will be investigating #79 https://github.com/PaulMcInnis/JobFunnel/issues/79 further from now on as I develop more unit tests for the scrapers. I suspect it is some kind of edge case. Have been testing JobFunnel on a clean Windows10 and seems to be working ok.

— You are receiving this because your review was requested. Reply to this email directly, view it on GitHub https://github.com/PaulMcInnis/JobFunnel/pull/75#issuecomment-645541317, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAYKY2IZQKKB5YQZAI3WB7TRXECR3ANCNFSM4NDROBRQ .

--

  • Paul McInnis