chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.09k stars 1.27k forks source link

[Bug]: Cosine distance greater than 1 #1214

Open gsakkis opened 1 year ago

gsakkis commented 1 year ago

What happened?

I've observed distances greater than one when using cosine distance with the default embedding function. Aren't the vectors normalized?

Versions

Chroma 0.4.13, Python 3.8, MacOS 13.2

Relevant log output

No response

HammadB commented 1 year ago

Could you share the distances you are seeing / perhaps a minimal repro?

The normalization occurs here - https://github.com/chroma-core/chroma/blob/e357ef397b3324edec4104756155a8836b03ed65/chromadb/utils/embedding_functions.py#L339

gsakkis commented 1 year ago

I'm afraid I can't share the documents but here are the vectors: xy.tar.gz Repro:

In [1]: from chromadb.utils.distance_functions import cosine

In [2]: import numpy as np

In [3]: x = np.load("x.npy")

In [4]: y = np.load("y.npy")

In [5]: cosine(x, y)
Out[5]: 1.046373932311302
gsakkis commented 1 year ago

I suspect it may be caused by the conversion of numpy arrays to lists:

(Pdb++) x1 = self._forward(texts)
(Pdb++) np.linalg.norm(x1)
1.0
(Pdb++) x2 = x1.tolist()
(Pdb++) np.linalg.norm(x2)
1.000000000800026
HammadB commented 1 year ago

Yes, are all your distances relatively close to one? This may be a numerical stability issue

gsakkis commented 1 year ago

I haven't run a huge number of queries but based on a small sample the largest distance I've seen was around 1.1 (1.098825). So it's close to 1 but many orders of magnitude further than the np.linalg.norm(x2) above.

cryans commented 1 year ago

I don't think this is a bug in chromadb, but a numerical stability issue as HammadB guesses (I'm guessing too)

I tried some different cosine functions.

import numpy as np
from chromadb.utils.distance_functions import cosine
import numpy as np
from scipy.spatial import distance

x = np.load("x.npy")
y = np.load("y.npy")

print(f'chromadb cosine: {cosine(x,y)}')
print(f'scipy.distance cosine: {(distance.cosine(x,y))}')
print(f'"raw np" cosine: {(1 - np.dot(x,y)/(np.linalg.norm(x)*np.linalg.norm(y)))}')

Output:

chromadb cosine: 1.046373932311302
scipy.distance cosine: 1.046373932311302
"raw np" cosine: 1.046373932311302

Suggest you close the bug

gsakkis commented 1 year ago

The above examples are irrelevant, the point is that chromadb should not generate vectors that have a dot product above 1. As I mentioned the issue is probably caused to converting numpy to python lists but it needs further investigation.

If you're ok with a 10% numerical error, feel free to close this.

cryans commented 1 year ago

I understand your point.

cosine should always be in [-1, 1] -- that's mathematically defined

There are three different approaches here (all wrong in terms of the cosine definition, but all agreeing numerically) -- can you see a way to fix it?

Another way to think about it: 1.05 is not a close cosine. So maybe "not close" is ok? Or what does L2 distance say here?

btw, I'm not on the project, and I see you're quite active, so I fully respect your abilities. I just tried to understand a bug report.