Open varuy322 opened 1 day ago
I think for Chinese text, unlike English where words are separated by space, it is safe to use a lower ngram size such as 2 or 3. (Modify the ngram code if necessary e.g. you can also use jieba instead)
With the following test code:
def jaccard_similarity_ngrams(str1, str2, n):
"""
Calculate Jaccard similarity between two strings based on n-grams.
Args:
str1 (str): First input string
str2 (str): Second input string
n (int): Size of n-grams
Returns:
float: Jaccard similarity score between 0 and 1
"""
# Convert strings to lowercase and remove non-alphanumeric characters
str1 = [char.lower() for char in str1 if char.isalnum()]
str2 = [char.lower() for char in str2 if char.isalnum()]
# Generate n-grams for both strings
ngrams1 = set(''.join(str1[i:i+n]) for i in range(len(str1) - n + 1))
ngrams2 = set(''.join(str2[i:i+n]) for i in range(len(str2) - n + 1))
print(ngrams1)
print(ngrams2)
# Calculate intersection and union of n-grams
intersection = ngrams1.intersection(ngrams2)
union = ngrams1.union(ngrams2)
# Calculate Jaccard similarity
if len(union) == 0:
return 0.0 # Handle empty sets
similarity = len(intersection) / len(union)
return similarity
# Example usage
text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。"
text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"
n = 2 # Using bigrams
similarity = jaccard_similarity_ngrams(text_1, text_2, n)
print(f"Jaccard similarity: {similarity:.4f}")
You will get 0.7313 and 0.7101 if choosing 3. Short text has always been tricky to process, it's better to tune the parameters/settings with a sample set before running the de-duplication on the entire dataset.
hi there,
when I use minhash with lsh or simhash, it's hard to remove short text. anybody could provide some useful method to solve this problem, thanks a ton!
take below example, and dive into the process:
set ngram_size is 5, the jacard similarity is 0.53, for ngram size 13, the similarity is 0.21.
when we select threshold 0.7, both below text will be kept.