dataworks / smarthire

SmartHire - An open source applicant prioritization system powered by machine learning
Apache License 2.0
5 stars 2 forks source link

Near duplicate detection #94

Closed davidmezzetti closed 8 years ago

davidmezzetti commented 8 years ago

Cryptographic hashes (MD5, SHA1, SHA256, bcrypt) are designed to produce very different outputs when the inputs have only a small variation. There are hashes out there that can do the opposite. Given slightly different inputs, generate slightly different outputs. This can be used for near duplicate detection.

Example library to support found below: https://github.com/codelibs/elasticsearch-minhash https://github.com/codelibs/minhash

davidmezzetti commented 8 years ago

Ran out of time for this one.