Kalkwst / MicroLib

MIT License
3 stars 0 forks source link

:sparkles:Text: Add Cosine Similarity and Distance calculation algorithms #2

Closed Kalkwst closed 1 year ago

Kalkwst commented 1 year ago

Add the implementation of the Cosine Similarity and Distance algorithms.

Cosine distance is a measure of similarity between two strings. It is calculated as the cosine of the angle between the two strings in a vector space, where each string is represented as a vector of term frequencies. The cosine distance is defined as 1 minus the cosine similarity, so a smaller cosine distance indicates a higher degree of similarity between the two strings.

Cosine distance is often used in natural language processing and information retrieval tasks to measure the similarity between documents or text strings. It is particularly useful for comparing strings that have different lengths, as it takes into account the relative frequencies of the terms rather than their absolute counts.

To calculate the cosine distance and similarity the algorithm first converts the strings into vectors of term frequencies. This involves tokenizing the strings into individual words and then counting the number of occurrences of each term in each string. The resulting vectors can then be used to calculate the cosine distance using the following formula:

cosine distance = 1 - (A B) / (||A|| ||B||)

Where A and B are the term frequency vectors for the two strings, and ||A|| and ||B|| are the magnitudes of the vectors. The cosine distance will always be a value between 0 and 1, with a value of 0 indicating that the two strings are identical and a value of 1 indicating that the two strings are completely dissimilar.