-
这样计算出来每篇内容的simhash值,然后进行比对么?我的网站是PHP的,服务器上已经安装了,怎么计算两个值的相似度呢?请问有PHP的计算的代码么?
-
This was a desired feature initially to make sure the underlying disassembly is good,
but makes tweaking / improving / adding new features difficult without breaking the
tests.
Particular culpri…
-
Прежде чем начать работу - нужно провести рисёрч инструментов/библиотек для удаления нечётких дубликатов строк, которые уже кем-то написаны. И полезно будет сразу же почитать о видах хешей (MinHash, S…
-
Near deduplication #7 only operates on file level. It is also possible for a file to be
1. a substring of another file, while the minhash/simhash fingerprints being wildly different
2. composed o…
-
We currently have a bunch of known blockpages that should be added to the pipeline (see: `fingerprintdb` label).
We should have tooling and documentation on how to add a blockpage fingerprint to th…
-
Traceback (most recent call last):
File "src/simhash_imp.py", line 191, in
feature_vec = [(int(item.split(':')[0]),float(item.split(':')[1])) for item in feature_vec]
ValueError: could not conv…
cxzhp updated
5 years ago
-
Hi! Thank you for this code, I've been studying it thoroughly and it is a very useful and helpful companion to the theory and algorithm sketches found in the MMD book. I have a few questions about som…
-
断点打到相似度计算中间发现的,simHash的每一个字符计算,最大位数也就只有42位,向量计算也就只有前42位有效,可能需要更换一下hash算法?
-
Thanks for this project! Is it time for a PyPI release? The current published version isn't compatible with Python 3, but the github version is working for me (tested `compute()` running on Python 3.7…
-
The manpage says:
> The algorithm used by simhash is Manassas' "shingleprinting" algorithm (see BIBLIOGRAPHY below): take a hash of every m-byte subsequence of the file, and retain the n of these h…