ChenghaoMou / text-dedup

All-in-one text de-duplication
Apache License 2.0
604 stars 69 forks source link

how to dedup short text? #103

Open varuy322 opened 1 day ago

varuy322 commented 1 day ago

hi there,

when I use minhash with lsh or simhash, it's hard to remove short text. anybody could provide some useful method to solve this problem, thanks a ton!

take below example, and dive into the process:

  1. set ngram_size is 5, the jacard similarity is 0.53, for ngram size 13, the similarity is 0.21.

  2. when we select threshold 0.7, both below text will be kept.

text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。" text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"

ChenghaoMou commented 1 day ago

I think for Chinese text, unlike English where words are separated by space, it is safe to use a lower ngram size such as 2 or 3. (Modify the ngram code if necessary e.g. you can also use jieba instead)

With the following test code:

def jaccard_similarity_ngrams(str1, str2, n):
    Calculate Jaccard similarity between two strings based on n-grams.

    str1 (str): First input string
    str2 (str): Second input string
    n (int): Size of n-grams

    float: Jaccard similarity score between 0 and 1
    # Convert strings to lowercase and remove non-alphanumeric characters
    str1 = [char.lower() for char in str1 if char.isalnum()]
    str2 = [char.lower() for char in str2 if char.isalnum()]

    # Generate n-grams for both strings
    ngrams1 = set(''.join(str1[i:i+n]) for i in range(len(str1) - n + 1))
    ngrams2 = set(''.join(str2[i:i+n]) for i in range(len(str2) - n + 1))

    # Calculate intersection and union of n-grams
    intersection = ngrams1.intersection(ngrams2)
    union = ngrams1.union(ngrams2)

    # Calculate Jaccard similarity
    if len(union) == 0:
        return 0.0  # Handle empty sets

    similarity = len(intersection) / len(union)
    return similarity

# Example usage
text_1 = "世界经济史是一部基于假象和谎言的连续剧。要获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。\nby某个名字都不能说的人。"
text_2 = "世界经济史是一部基于假象和谎言的连续剧。要想获得财富,做法就是认清其假象,投入其中,然后在假象被公众认识之前退出游戏。——索罗斯"
n = 2  # Using bigrams

similarity = jaccard_similarity_ngrams(text_1, text_2, n)
print(f"Jaccard similarity: {similarity:.4f}")

You will get 0.7313 and 0.7101 if choosing 3. Short text has always been tricky to process, it's better to tune the parameters/settings with a sample set before running the de-duplication on the entire dataset.