Open gsakkis opened 1 year ago
Could you share the distances you are seeing / perhaps a minimal repro?
The normalization occurs here - https://github.com/chroma-core/chroma/blob/e357ef397b3324edec4104756155a8836b03ed65/chromadb/utils/embedding_functions.py#L339
I'm afraid I can't share the documents but here are the vectors: xy.tar.gz Repro:
In [1]: from chromadb.utils.distance_functions import cosine
In [2]: import numpy as np
In [3]: x = np.load("x.npy")
In [4]: y = np.load("y.npy")
In [5]: cosine(x, y)
Out[5]: 1.046373932311302
I suspect it may be caused by the conversion of numpy arrays to lists:
(Pdb++) x1 = self._forward(texts)
(Pdb++) np.linalg.norm(x1)
1.0
(Pdb++) x2 = x1.tolist()
(Pdb++) np.linalg.norm(x2)
1.000000000800026
Yes, are all your distances relatively close to one? This may be a numerical stability issue
I haven't run a huge number of queries but based on a small sample the largest distance I've seen was around 1.1 (1.098825). So it's close to 1 but many orders of magnitude further than the np.linalg.norm(x2)
above.
I don't think this is a bug in chromadb, but a numerical stability issue as HammadB guesses (I'm guessing too)
I tried some different cosine functions.
import numpy as np
from chromadb.utils.distance_functions import cosine
import numpy as np
from scipy.spatial import distance
x = np.load("x.npy")
y = np.load("y.npy")
print(f'chromadb cosine: {cosine(x,y)}')
print(f'scipy.distance cosine: {(distance.cosine(x,y))}')
print(f'"raw np" cosine: {(1 - np.dot(x,y)/(np.linalg.norm(x)*np.linalg.norm(y)))}')
Output:
chromadb cosine: 1.046373932311302
scipy.distance cosine: 1.046373932311302
"raw np" cosine: 1.046373932311302
Suggest you close the bug
The above examples are irrelevant, the point is that chromadb should not generate vectors that have a dot product above 1. As I mentioned the issue is probably caused to converting numpy to python lists but it needs further investigation.
If you're ok with a 10% numerical error, feel free to close this.
I understand your point.
cosine should always be in [-1, 1] -- that's mathematically defined
There are three different approaches here (all wrong in terms of the cosine definition, but all agreeing numerically) -- can you see a way to fix it?
Another way to think about it: 1.05 is not a close cosine. So maybe "not close" is ok? Or what does L2 distance say here?
btw, I'm not on the project, and I see you're quite active, so I fully respect your abilities. I just tried to understand a bug report.
What happened?
I've observed distances greater than one when using cosine distance with the default embedding function. Aren't the vectors normalized?
Versions
Chroma 0.4.13, Python 3.8, MacOS 13.2
Relevant log output
No response