PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.81k stars 212 forks source link

added summarization to blurb #57

Closed studentbrad closed 4 years ago

studentbrad commented 4 years ago

Blurb Summarization :1st_place_medal:

It seems obvious that we needed some way of formatting and shortening the blurb. For this I added summarization from gensim. gensim is a very good library for creating quick summaries with a designated number of words.

Context of change

Type of change

How Has This Been Tested?

Checklist:

bunsenmurder commented 4 years ago

I really like this idea, but we should definitely be doing exploratory data analyses and sharing the results before implementing features that use machine learning like this one. Implementing this would reduce our duplicate filter accuracy a lot. We would need to save all of our full blurbs in a separate file from the master list for this to work without affecting duplicate filtering.

PaulMcInnis commented 4 years ago

love this feature @studentbrad !

I agree with @bunsenmurder , we should retain the complete text somewhere so that the data can be used for other things including the similarity filter.

Perhaps we can just have blurb be the shortened text and add a new column to store the complete scraped text?

This greatly improves usability when reading thru job postings, can always just hide the column of raw/scraped text - though storing it elsewhere would be a cleaner option.

studentbrad commented 4 years ago

@bunsenmurder @PaulMcInnis I have taken your advice but this may be harder to implement than I thought. I want to preserve the what is now called the description (aka job['description']). However, I do not want this in the .csv. I only want the blurb. This contradicts how we use the .csv. The .csv is loaded as a job dictionary, so any column not included in the .csv is left blank (the description). I am confused on how handle this because we need to parse the .csv to update status'. It can be done using a combination of .csv and pickle parsing, but complicates the project. Thoughts?

markkvdb commented 4 years ago

Maybe something of a relational database. Make an Id for all job posting and store the original text with this id connected to a job.

bunsenmurder commented 4 years ago

The idea that @markkvdb had could work, as we could save every job description per id in a master 'database' file.

Then before running our similarity filter, we match jobs in the master list to our database and replace the blurb with the full description in our dictionary object. Then after the similarity filter runs, we just apply the summarizer on the final product.

studentbrad commented 4 years ago

I can maintain backward compatibility. The description will be stored elsewhere but the blurb will become a summarised version of the description in masterlist.csv. Sometimes the description cannot be summarised. In this case, the description becomes the blurb aka. job['blurb'] = job['description']. So, maintaining backward compatibility works with opposite logic. If we read the masterlist.csv and find no description stored elsewhere for that job ID; then the blurb becomes the description aka. job['description'] = job['blurb'].

studentbrad commented 4 years ago

I will close this PR for now until I have found a solution. Anyone is free to make suggestions in the meantime.