studentbrad commented 4 years ago

Blurb Summarization :1st_place_medal:

It seems obvious that we needed some way of formatting and shortening the blurb. For this I added summarization from gensim. gensim is a very good library for creating quick summaries with a designated number of words.

Context of change

[x] Software (software that runs on the PC)
[x] Library (library that runs on the PC)
[ ] Tool (tool that assists coding development)
[ ] Other

Type of change

[ ] Bug fix (non-breaking change which fixes an issue)
[x] New feature (non-breaking change which adds functionality)
[ ] Breaking change (fix or feature that would cause existing functionality to not work as expected)
[ ] This change requires a documentation update

How Has This Been Tested?

[x] Manual testing on my PC. Running the JobFunnel software and checking the .csv, I can see that the summarization infact works.

Checklist:

[x] I have performed a self-review of my own code.
[x] I have commented my code, particularly in hard-to-understand areas.
[ ] I have made corresponding changes to the documentation.
[x] My changes generate no new warnings.
[ ] I have added tests that prove my fix is effective or that my feature works.
[x] New and existing unit tests pass locally with my changes.
[x] Any dependent changes have been merged and published in downstream modules.

bunsenmurder commented 4 years ago

I really like this idea, but we should definitely be doing exploratory data analyses and sharing the results before implementing features that use machine learning like this one. Implementing this would reduce our duplicate filter accuracy a lot. We would need to save all of our full blurbs in a separate file from the master list for this to work without affecting duplicate filtering.

PaulMcInnis commented 4 years ago

love this feature @studentbrad !

I agree with @bunsenmurder , we should retain the complete text somewhere so that the data can be used for other things including the similarity filter.

Perhaps we can just have blurb be the shortened text and add a new column to store the complete scraped text?

This greatly improves usability when reading thru job postings, can always just hide the column of raw/scraped text - though storing it elsewhere would be a cleaner option.

studentbrad commented 4 years ago

@bunsenmurder @PaulMcInnis I have taken your advice but this may be harder to implement than I thought. I want to preserve the what is now called the description (aka job['description']). However, I do not want this in the .csv. I only want the blurb. This contradicts how we use the .csv. The .csv is loaded as a job dictionary, so any column not included in the .csv is left blank (the description). I am confused on how handle this because we need to parse the .csv to update status'. It can be done using a combination of .csv and pickle parsing, but complicates the project. Thoughts?

markkvdb commented 4 years ago

Maybe something of a relational database. Make an Id for all job posting and store the original text with this id connected to a job.

bunsenmurder commented 4 years ago

The idea that @markkvdb had could work, as we could save every job description per id in a master 'database' file.

Then before running our similarity filter, we match jobs in the master list to our database and replace the blurb with the full description in our dictionary object. Then after the similarity filter runs, we just apply the summarizer on the final product.

studentbrad commented 4 years ago

I can maintain backward compatibility. The description will be stored elsewhere but the blurb will become a summarised version of the description in masterlist.csv. Sometimes the description cannot be summarised. In this case, the description becomes the blurb aka. job['blurb'] = job['description']. So, maintaining backward compatibility works with opposite logic. If we read the masterlist.csv and find no description stored elsewhere for that job ID; then the blurb becomes the description aka. job['description'] = job['blurb'].

studentbrad commented 4 years ago

I will close this PR for now until I have found a solution. Anyone is free to make suggestions in the meantime.

PaulMcInnis / JobFunnel

added summarization to blurb #57

Blurb Summarization :1st_place_medal:

Context of change

Type of change

How Has This Been Tested?

Checklist: