AI-team-UoA / pyJedAI

An open-source library that leverages Python’s data science ecosystem to build powerful end-to-end Entity Resolution workflows.
https://pyjedai.readthedocs.io
Apache License 2.0
72 stars 11 forks source link

Entity Matching metrics get sim score error #11

Closed reversingentropy closed 1 year ago

reversingentropy commented 1 year ago

Hi, there are issues with the entity matching portion. As seen from the tutorials...

image

As far as I understand, EntityMatching via "from pyjedai.matching import EntityMatching" has a keyword argument metric. It allows ['jaccard', 'jaro', 'edit_distance', 'Frequency', 'BM25F', 'cosine', 'TF-IDF','overlap_coefficient', 'generalized_jaccard', 'dice', 'PL2'] which are string matching algorithms.

The algorithms that have issues are ['PL2', 'TF-IDF', 'BM25F', 'Frequency'] This is an error if i typed metric = 'PL2'.

image

JacobMaciejewski commented 1 year ago

Hello, and I am sorry for the late reply. The fully capitalized similarity function names refer to Whoosh similarity functions, and can only be chosen in the context of Progressive Entity Matching using the Whoosh algorithm. Those functions will be renamed in the next official release or fully removed as Whoosh util is deprecated. The only reason they are grouped together with conventional similarity functions is due to specific argument names convention bound to the Progressive Workflow util. Some of its development code dependencies have been included in the latest release, even though the util is not fully ready yet.

reversingentropy commented 1 year ago

Hi thank you for replying and the clarification!

Not sure if this is a helpful replacement (TF-IDF) in matching.py :

from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(entity1, entity2):

Convert entities to lowercase

entity1 = entity1.lower()
entity2 = entity2.lower()

# Initialize and fit the TfidfVectorizer
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([entity1, entity2])

# Calculate cosine similarity between the two documents
similarity_score = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])[0][0]

return similarity_score

ent1 = "Linksys EtherFast 8-Port 10/100 Switch - EZXS88W Linksys EtherFast 8-Port 10/100 Switch - EZXS88W/ 10/100 Dual-Speed Per-Port/ Perfect For Optimizing 10BaseT And 100BaseTX Hardware On The Same Network/ Speeds Of Up To 200Mbps In Full Duplex Operation/ Eliminate Bandwidth Constraints And Clear Up Bottlenecks $44.00"

ent2 = "Linksys EtherFast EZXS88W Ethernet Switch - EZXS88W Linksys EtherFast 8-Port 10/100 Switch (New/Workgroup) LINKSYS"

similarity = calculate_similarity(ent1, ent2)

print("TF-IDF similarity between the entities:", similarity)

Result is : TF-IDF & cosine similarity between the entities: 0.46203393546758753

JacobMaciejewski commented 1 year ago

We are always open to new ideas and corrections on the already deployed code. We will definitely have a look at it. Don't hesitate to set up your own branch and send us pull requests. We will review it and may include some of your own solutions in the framework.

Nikoletos-K commented 1 year ago

Hi, if I understand correctly you suggest us to change the similarity method. We use pairwise_distances from sklearn that supports both jaccard. dice and cosine and it's the same implementation as I tested. Please clarify if you propose something different.

Thank you in any case!