Closed reversingentropy closed 1 year ago
Hello, and I am sorry for the late reply. The fully capitalized similarity function names refer to Whoosh similarity functions, and can only be chosen in the context of Progressive Entity Matching using the Whoosh algorithm. Those functions will be renamed in the next official release or fully removed as Whoosh util is deprecated. The only reason they are grouped together with conventional similarity functions is due to specific argument names convention bound to the Progressive Workflow util. Some of its development code dependencies have been included in the latest release, even though the util is not fully ready yet.
Hi thank you for replying and the clarification!
Not sure if this is a helpful replacement (TF-IDF) in matching.py :
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
def calculate_similarity(entity1, entity2):
entity1 = entity1.lower()
entity2 = entity2.lower()
# Initialize and fit the TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([entity1, entity2])
# Calculate cosine similarity between the two documents
similarity_score = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]
return similarity_score
ent1 = "Linksys EtherFast 8-Port 10/100 Switch - EZXS88W Linksys EtherFast 8-Port 10/100 Switch - EZXS88W/ 10/100 Dual-Speed Per-Port/ Perfect For Optimizing 10BaseT And 100BaseTX Hardware On The Same Network/ Speeds Of Up To 200Mbps In Full Duplex Operation/ Eliminate Bandwidth Constraints And Clear Up Bottlenecks $44.00"
ent2 = "Linksys EtherFast EZXS88W Ethernet Switch - EZXS88W Linksys EtherFast 8-Port 10/100 Switch (New/Workgroup) LINKSYS"
similarity = calculate_similarity(ent1, ent2)
print("TF-IDF similarity between the entities:", similarity)
We are always open to new ideas and corrections on the already deployed code. We will definitely have a look at it. Don't hesitate to set up your own branch and send us pull requests. We will review it and may include some of your own solutions in the framework.
Hi, if I understand correctly you suggest us to change the similarity method. We use pairwise_distances
from sklearn that supports both jaccard. dice and cosine and it's the same implementation as I tested. Please clarify if you propose something different.
Thank you in any case!
Hi, there are issues with the entity matching portion. As seen from the tutorials...
As far as I understand, EntityMatching via "from pyjedai.matching import EntityMatching" has a keyword argument metric. It allows ['jaccard', 'jaro', 'edit_distance', 'Frequency', 'BM25F', 'cosine', 'TF-IDF','overlap_coefficient', 'generalized_jaccard', 'dice', 'PL2'] which are string matching algorithms.
The algorithms that have issues are ['PL2', 'TF-IDF', 'BM25F', 'Frequency'] This is an error if i typed metric = 'PL2'.