目标和参考资料 - Githubissues

guonaihong commented 4 years ago

目标

用go实现字符串相似度lib
处理中文准确度较高(目前很多老外写的库处理中文效果不佳)
集成多种相似度算法(编辑距离,汉明编码，骰子系数)

莱文斯坦-编辑距离(Levenshtein)

https://zhuanlan.zhihu.com/p/91667128
https://www.jianshu.com/p/a617d20162cf (以上两份参考资料都是创建矩阵，看完算法之后感悟，没有必要创建矩阵，只要缓存x坐标+对角线一个值就行，实现效果一样)
http://richardminerich.com/tag/damerau-levenshtein-distance/ (补充)
Hamming
https://baike.baidu.com/item/%E6%B1%89%E6%98%8E%E8%B7%9D%E7%A6%BB/475174?fr=aladdin

Dice's coefficient

https://blog.csdn.net/gjk0223/article/details/2314844 n个字符算集合一个元素，这点容易忽略，n是可以配置的，很多开源项目都忽略这点。原论文公式是 2 *(a 和b的交集) /(len(a) + len(b))，默认选择2，但是2对中文不太友好

Jaro

TODO

Damerau-Levenshtein - distance & normalized
Jaro and Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix length

补充

https://help.highbond.com/helpdocs/analytics/13/scripting-guide/zh-cn/Content/lang_ref/functions/r_dicecoefficient.htm

参考API设计(取名)

https://github.com/hbakhtiyor/strsim
参考选用了哪些算法名字
https://github.com/dguo/strsim-rs

SummerSec commented 2 years ago

余弦相似度算法考虑实现吗？

guonaihong commented 2 years ago

余弦相似度算法考虑实现吗？

可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了.

SummerSec commented 2 years ago

余弦相似度算法考虑实现吗？

可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了.

OK，我看看能不能学会，康康能不能提交一个pr，simhash也是相似度算法。

guonaihong commented 2 years ago

欢迎pr😊

---原始邮件--- 发件人: @.> 发送时间: 2022年5月24日(周二) 晚上9:12 收件人: @.>; 抄送: @.**@.>; 主题: Re: [antlabs/strsim] 目标和参考资料 (#1)

余弦相似度算法考虑实现吗？

可以, 不过要排https://github.com/guonaihong/gstl 这个项目发布之后了.

OK，我看看能不能学会，康康能不能提交一个pr，simhash也是相似度算法。

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you authored the thread.Message ID: @.***>

SummerSec commented 2 years ago

我通过简单实验发现如果将文本进行base64编码之后，相似度结果会更加精确一些。示例代码：

package main

import (
    "encoding/base64"
    "fmt"
    "github.com/antlabs/strsim"
)

func main() {
    en := base64.NewEncoding("ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/")
    s1 := "<!DOCTYPE html>\n<html>\n<head>\n<title>Error</title>\n<style>\n                                                       \n    body {\n        width: 35em;\n        margin: 0 auto;\n        font-family: Tahoma, Verdana, Arial, sans-serif;\n    }\n                                                     \n</style>\n</head>\n<body>\n<h1><center>An error occurred.</center></h1>\n<p><center>An error occurred.</center></p>\n<p><center>                                                                                                        </center></p>\n</body>\n</html>"
    s2 := "<html>\n<head><title>404 Not Found</title></head>\n<body>\n<center><h1>404 Not Found</h1></center>\n<hr><center>openresty</center>\n</body>\n</html>\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n<!-- a padding to disable MSIE and Chrome friendly error page -->\n"
    str1 := en.EncodeToString([]byte(s1))
    str2 := en.EncodeToString([]byte(s2))
    // 默认算法的相似度 莱文斯坦-编辑距离
    a1 := strsim.Compare(str1, str2)
    b1 := strsim.Compare(s1, s2)
    fmt.Printf("Defalut Levenshtein  Similarity base64 %f original %f\n", a1, b1)
    // Hamming算法计算出的相似度
    a2 := strsim.Compare(str1, str2, strsim.Hamming())
    b2 := strsim.Compare(s1, s2, strsim.Hamming())
    fmt.Printf("Hamming Similarity base64 %f original %f\n", a2, b2)

    // Dice's coefficient 算法计算出的相似度
    a3 := strsim.Compare(str1, str2, strsim.DiceCoefficient())
    b3 := strsim.Compare(s1, s2, strsim.DiceCoefficient())
    fmt.Printf("Dice's coefficient Similarity  base64 %f original %f\n", a3, b3)

    //jaro 算法计算出的相似度
    a4 := strsim.Compare(str1, str2, strsim.Jaro())
    b4 := strsim.Compare(str1, str2, strsim.Jaro())
    fmt.Printf("jaro Similarity  base64 %f original %f\n", a4, b4)

}

运行测试结果，我想将这个发现提交pr，我觉得是很有必要的。

Defalut Levenshtein Similarity base64 0.111264 original 0.156250 Hamming Similarity base64 0.067308 original 0.084559 Dice's coefficient Similarity base64 0.229599 original 0.286772 jaro Similarity base64 0.521006 original 0.521006

SummerSec commented 2 years ago

另外，我想加入你们组织，不知是否可以？

SummerSec commented 2 years ago

我大致简单说明一下我加入的余弦相似度算法和Jaro-Winkler算法首先是Jaro-Winkler算法，我在看相关文档发现你提出Jaro-Winkler - this implementation of Jaro-Winkler does not limit the common prefix lengt问题的解决方法，相同的前缀最大值为4，也就是p的影响因子的取值范围为[0,4]是一个闭区间，如果大于4取4。在jaro-and-jaro-winkler-similarity有着对应的代码

SummerSec commented 2 years ago

余弦相似度，相关原理就是高中学过的空间向量定理，两个向量夹角的角度。但余弦相似度得计算词出现频率，如果用分词的效率不高，并且计算量，内存开销都会很大。于是我想到采用base64编码的方式，因为base64编码的字符串同一个字符是相同的，也等价的计算词语出现的频率。然后将base64标准字符串进行余弦计算。

package main

import (
    "fmt"
    "github.com/antlabs/strsim"
)

func main() {
    Levenshtein := strsim.Compare("abc", "ab")
    fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
    dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
    fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
    jaro := strsim.Compare("abc", "ab", strsim.Jaro())

    fmt.Printf("Jaro Algorithm: %f\n", jaro)
    jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
    fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
    ham := strsim.Compare("abc", "ab", strsim.Hamming())
    fmt.Printf("Hamming Algorithm: %f\n", ham)

    consine := strsim.Compare("abc", "ab", strsim.Cosine())
    fmt.Printf("Consine Similarity Algorithm : %f\n", consine)
}

最终计算出来的结果也是可观

Levenshtein Algorithm: 0.666667 Dice Coefficient Algorithm: 0.666667 Jaro Algorithm: 0.888889 Jaro Winkler Algorithm: 0.888889 Hamming Algorithm: 0.666667 Consine Similarity Algorithm : 0.577350

SummerSec commented 2 years ago

simhash的难点感觉是在分词和加权，分词处理之后，加权操作目前没有很好的解决方法。分词的话，如果是中英文的都有的情况也很难处理，比例说网页的源代码。对于分词这部分，我目前简单将文本进行base64编码，然后将粗暴的以四个字符分为一组。对于加权这块，我是统计每组出现的概率进行加权。

    Levenshtein := strsim.Compare("abc", "ab")
    fmt.Printf("Levenshtein Algorithm: %f\n", Levenshtein)
    dice := strsim.Compare("abc", "ab", strsim.DiceCoefficient())
    fmt.Printf("Dice Coefficient Algorithm: %f\n", dice)
    jaro := strsim.Compare("abc", "ab", strsim.Jaro())

    fmt.Printf("Jaro Algorithm: %f\n", jaro)
    jarow := strsim.Compare("abc", "ab", strsim.JaroWinkler())
    fmt.Printf("Jaro Winkler Algorithm: %f\n", jarow)
    ham := strsim.Compare("abc", "ab", strsim.Hamming())
    fmt.Printf("Hamming Algorithm: %f\n", ham)

    consine := strsim.Compare("abc", "ab", strsim.Cosine())
    fmt.Printf("Consine Similarity Algorithm : %f\n", consine)

    simhash := strsim.Compare("abc", "ab", strsim.Simhash())
    fmt.Printf("Simhash Algorithm: %f \n", simhash)

目前没有经过大量测试，简单结果仅限参考。

Levenshtein Algorithm: 0.666667 Dice Coefficient Algorithm: 0.666667 Jaro Algorithm: 0.888889 Jaro Winkler Algorithm: 0.888889 Hamming Algorithm: 0.666667 Consine Similarity Algorithm : 0.577350 Simhash Algorithm: 0.500000

kooksee commented 1 year ago

https://en.wikipedia.org/wiki/MinHash 这个也可以集成下，我们之前用于文档的相识度，用这个比较的

guonaihong commented 1 year ago

https://en.wikipedia.org/wiki/MinHash 这个也可以集成下，我们之前用于文档的相识度，用这个比较的

可以。

zzZfans commented 1 year ago

那种算法的表现最好？

antlabs / strsim

目标和参考资料 #1

目标

莱文斯坦-编辑距离(Levenshtein)

Hamming

Dice's coefficient

Jaro

TODO

补充

参考API设计(取名)

参考选用了哪些算法名字