MartinoMensio / spacy-universal-sentence-encoder

Google USE (Universal Sentence Encoder) for spaCy
MIT License
176 stars 12 forks source link

Question on certain use case: Checklist Deduplication #20

Open BradKML opened 2 years ago

BradKML commented 2 years ago

Given a list of sentences and words, and assuming that I want to deduplicate them, what is the best way to automate the elimination of duplicate items (similar wordings of the same item)?

BradKML commented 2 years ago
import spacy_universal_sentence_encoder

nlp = spacy_universal_sentence_encoder.load_model('en_use_lg')

with open('file.txt') as f:
    lines = f.readlines()

lines2 = [nlp(i).vector for i in lines]

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

from sklearn.cluster import AgglomerativeClustering

k=256

cluster = AgglomerativeClustering(n_clusters=k, affinity='euclidean', linkage='ward')
a = cluster.fit_predict(lines2)

for i in range(k):
  print(*[lines[j] for j in [j for j, x in enumerate(a) if x == i]])
  print()  

with open("myfile.txt", "w") as file1:
    for i in range(k):
      file1.writelines([lines[j] for j in [j for j, x in enumerate(a) if x == i]])
      file1.write("\n")