Introduces a New Implementation of the Cosine Similarity Algorithm in the Cosine_Similarity class. Cosine Similarity is a widely used metric in Natural Language Processing and Information retrieval to measure the similarity between two texts based on their Vector Representations.
Key Features
Vector Representation: Utilizes SpaCy's pre-trained Word Embeddings to convert text into Vectors.
Tokenization: Breaks down input text into lowercased tokens, excluding punctuation.
Vectorization: Converts tokens into their corresponding vectors using SpaCy's embeddings.
Mean Vector Calculation: Computes the Mean Vector for a set of Word Vectors to represent the overall text.
Cosine Similarity Calculation: Measures the cosine of the angle between two vectors, providing a Similarity Score ranging from -1 to 1.
Cosine Similarity Percentage: Outputs the similarity score as a percentage, facilitating easier interpretation.
Mathematical Foundation
Dot Product: Measures the Degree of Alignment between two Vectors.
Magnitude (Norm): Computes the length of a Vector.
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.
Usage
The Cosine_Similarity class provides methods to Tokenize, Vectorize, and calculate the Cosine Similarity between two pieces of text. It includes:
Tokenize(text): Tokenizes the input text into lowercase tokens.
Vectorize(tokens): Converts tokens into vector representations.
Mean_Vector(vectors): Computes the average vector of a list of vectors.
Dot_Product(vector1, vector2): Calculates the dot product of two vectors.
Magnitude(vector): Computes the magnitude of a vector.
Cosine_Similarity(vector1, vector2): Computes the cosine similarity between two vectors.
Cosine_Similarity_Percentage(text1, text2): Calculates the similarity percentage between two texts.
Error Handling
Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.
Benefits
Provides an effective method for comparing textual content.
Leverages pre-trained embeddings for accurate and efficient similarity measurement.
Can be used in various applications including document similarity, search relevance, and recommendation systems.
Cosine Similarity Algorithm
Overview
Introduces a New Implementation of the Cosine Similarity Algorithm in the
Cosine_Similarity
class. Cosine Similarity is a widely used metric inNatural Language Processing
and Information retrieval to measure the similarity between two texts based on their Vector Representations.Key Features
SpaCy's
pre-trainedWord Embeddings
to convert text into Vectors.SpaCy's
embeddings.Mean Vector
for a set of Word Vectors to represent the overall text.Similarity Score
ranging from -1 to 1.Mathematical Foundation
Cosine Similarity Formula:
Cosine Similarity = (Dot Product) / (Magnitude_1 * Magnitude_2)
where the result is normalized to lie between -1 and 1, with 1 indicating identical vectors, 0 indicating orthogonal vectors, and -1 indicating completely dissimilar vectors.
Usage
The
Cosine_Similarity
class provides methods toTokenize
,Vectorize
, and calculate theCosine Similarity
between two pieces of text. It includes:Tokenize(text)
: Tokenizes the input text into lowercase tokens.Vectorize(tokens)
: Converts tokens into vector representations.Mean_Vector(vectors)
: Computes the average vector of a list of vectors.Dot_Product(vector1, vector2)
: Calculates the dot product of two vectors.Magnitude(vector)
: Computes the magnitude of a vector.Cosine_Similarity(vector1, vector2)
: Computes the cosine similarity between two vectors.Cosine_Similarity_Percentage(text1, text2)
: Calculates the similarity percentage between two texts.Error Handling
Robust Error Handling is implemented for all operations to ensure reliability. Any issues encountered during tokenization, vectorization, or similarity calculations are logged and raised appropriately.
Benefits