First README example fails

cpa commented 4 days ago

When running the first README example on a fresh venv, the code fails in two ways.

First it complains that docs is undefined.
If I add something like docs = ["hi there", "bonjour", "buongiorno", "ola"] in the code, I get this. The ValueError makes me believes that it's an internal problem with the lib, not with the way I use it, but I'm not totally sure.

/Users/cpa/Developer/wordsimilarity/.venv/lib/python3.11/site-packages/wordllama/algorithms/kmeans.py:30: RuntimeWarning: invalid value encountered in divide
  probabilities /= probabilities.sum()
Traceback (most recent call last):
  File "/Users/cpa/Developer/wordsimilarity/test.py", line 27, in <module>
    wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/cpa/Developer/wordsimilarity/.venv/lib/python3.11/site-packages/wordllama/inference.py", line 337, in cluster
    cluster_labels, loss = kmeans_clustering(
                           ^^^^^^^^^^^^^^^^^^
  File "/Users/cpa/Developer/wordsimilarity/.venv/lib/python3.11/site-packages/wordllama/algorithms/kmeans.py", line 84, in kmeans_clustering
    distances = compute_distances(embeddings, centroids)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "wordllama/algorithms/kmeans_helpers.pyx", line 16, in wordllama.algorithms.kmeans_helpers.compute_distances
ValueError: Buffer dtype mismatch, expected 'const double' but got 'float'

BTW, awesome project, I always wanted to start a "semantic equality" lib using basically the same methods but never found the time.

For reference, the code snippet I'm talking about:

from wordllama import WordLlama

# Load the default WordLlama model
wl = WordLlama.load()

# Calculate similarity between two sentences
similarity_score = wl.similarity("i went to the car", "i went to the pawn shop")
print(similarity_score)  # Output: 0.06641249096796882

# Rank documents based on their similarity to a query
query = "i went to the car"
candidates = ["i went to the park", "i went to the shop", "i went to the truck", "i went to the vehicle"]
ranked_docs = wl.rank(query, candidates)
print(ranked_docs)
# Output:
# [
#   ('i went to the vehicle', 0.7441646856486314),
#   ('i went to the truck', 0.2832691551894259),
#   ('i went to the shop', 0.19732814982305436),
#   ('i went to the park', 0.15101404519322253)
# ]

# additional inference methods
wl.deduplicate(candidates, threshold=0.8) # fuzzy deduplication
wl.cluster(docs, k=5, max_iterations=100, tolerance=1e-4) # labels using kmeans/kmeans++ init
wl.filter(query, candidates, threshold=0.3) # filter candidates based on query
wl.topk(query, candidates, k=3) # return topk strings based on query

xrd commented 4 days ago

I tried to switch docs in the wl.cluster example to candidates and this did not work either. If I comment that out, all the other examples work.

dleemiller commented 4 days ago

Thanks for reporting! I'll check it out later today when I have some time to work on it, and update.

dleemiller commented 3 days ago

This is has been resolved by: https://github.com/dleemiller/WordLlama/pull/11

Thanks again for contributing an issue.

dleemiller / WordLlama

First README example fails #9