PaulMcInnis / JobFunnel

Scrape job websites into a single spreadsheet with no duplicates.
MIT License
1.85k stars 215 forks source link

add TFIDF cosine similarity #19

Closed PaulMcInnis closed 5 years ago

PaulMcInnis commented 5 years ago

This adds a tool for detecting and removing duplicate job postings.

itSeez commented 5 years ago

fixes #16

itSeez commented 5 years ago

Don't forget to update the version number ;)

studentbrad commented 5 years ago

This is great 💯 I am just curious if there is a case where two job postings may be flagged as similar but are in fact different. Am I understanding TFIDF cosine similarity correctly?

PaulMcInnis commented 5 years ago

Yeah its possible if the posting contained 10 words and 8 of those words are shared, that new posting would be considered similar. That said, blurbs are usually longer. I can add a min comparison size or something similar?

On Sun, Jul 7, 2019 at 9:23 PM Bradley Aaron Kohler < notifications@github.com> wrote:

This is great 💯 I am just curious if there is a case where two job postings may be flagged as similar but are in fact different. Am I understanding TFIDF cosine similarity correctly?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/PaulMcInnis/JobFunnel/pull/19?email_source=notifications&email_token=AAYKY2L4RFX4STHNDP67IGTP6KJJVA5CNFSM4H6WY7E2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODZLXMPA#issuecomment-509048380, or mute the thread https://github.com/notifications/unsubscribe-auth/AAYKY2OVO4UJ63RYRWQWE33P6KJJVANCNFSM4H6WY7EQ .

--

  • Paul McInnis
studentbrad commented 5 years ago

@PaulMcInnis I assume you know what you're doing. I just wanted to see if something like that is possible but I guess the likelihood of that is next to none.