chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.01k stars 1.26k forks source link

[Bug]: Using Cosine distance for a new collection returns L2 (Euclidean) distances. #2668

Closed botchagalupeai closed 2 months ago

botchagalupeai commented 2 months ago

What happened?

I create a collection with the proper metadata for creating Cosine distances and when I do a query my output is always in L2 distances. I have tried both L2 and Cosine and the results are also the same. I also recreated the input data and the collection each time.

chroma_client = chromadb.Client()

# The default is Euclidean distance
#chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022", embedding_function=embedding_function)
chroma_collection = chroma_client.create_collection("microsoft_annual_report_2022x", embedding_function=embedding_function,metadata={"hnsw:space": "cosine"})

ids = [str(i) for i in range(len(token_split_texts))]

chroma_collection.add(ids=ids, documents=token_split_texts)
chroma_collection.count()

Here's the query.

query = "What was the total revenue?"

results = chroma_collection.query(query_texts=[query], n_results=5)
retrieved_documents = results['documents'][0]

# Iterate through the dictionary and print each key with its associated value
for key, value in results.items():
    print(f"{key}:")

    # Check if the value is a list and print its elements
    if isinstance(value, list):
        for i, item in enumerate(value):
            print(f"  Item {i+1}: {item}")
    else:
        # Directly print the value if it's not a list
        print(f"  {value}")

    print()  # Add a newline for better readability

for document in retrieved_documents:
    print((document))
    print('\n')

I've also discussed this issue on Discord before I filled this issue.

discord

Versions

Python 3.10.12 Chroma 0.5.0 Same results in bot Conda 24.7.1 and Google Colab (Pro) ProductName: macOS ProductVersion: 14.5 BuildVersion: 23F79

Relevant log output

ids:
  Item 1: ['321', '293', '331', '319', '194']

distances:
  Item 1: [0.39571714401245117, 0.42433130741119385, 0.4562004804611206, 0.4578591585159302, 0.4580157995223999]

metadatas:
  Item 1: [None, None, None, None, None]

embeddings:
  None

documents:
  Item 1: ['revenue, classified by significant product and service offerings, was as follows : ( in millions ) year ended june 30, 2022 2021 2020 server products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22, 488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8, 077 search and news advertising 11, 591 9, 267 8, 524 enterprise services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291 4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast certain previously reported amounts in the table above to conform to the way we internally manage and monitor our business.', '74 note 13 — unearned revenue unearned revenue by segment was as follows : ( in millions ) june 30, 2022 2021 productivity and business processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in unearned revenue were as follows : ( in millions ) year ended june 30, 2022 balance, beginning of period $ 44, 141 deferral of revenue 110, 455 recognition of unearned revenue ( 106, 188 ) balance, end of period $ 48, 408 revenue allocated to remaining performance obligations, which includes unearned revenue and amounts that will be invoiced and recognized as revenue in future periods, was $ 193 billion as of june 30, 2022, of which $ 189 billion is related to the commercial portion of revenue. we expect to recognize approximately 45 % of this revenue over the next 12', 'that are not sold separately. • we tested the mathematical accuracy of management ’ s calculations of revenue and the associated timing of revenue recognized in the financial statements.', '82 in addition, certain costs incurred at a corporate level that are identifiable and that benefit our segments are allocated to them. these allocated costs include legal, including settlements and fines, information technology, human resources, finance, excise taxes, field selling, shared facilities services, and customer service and support. each allocation is measured differently based on the specific facts and circumstances of the costs being allocated. segment revenue and operating income were as follows during the periods presented : ( in millions ) year ended june 30, 2022 2021 2020 revenue productivity and business processes $ 63, 364 $ 53, 915 $ 46, 398 intelligent cloud 75, 251 60, 080 48, 366 more personal computing 59, 655 54, 093 48, 251 total $ 198, 270 $ 168, 088 $ 143, 015 operating income', '47 financial statements and supplementary data income statements ( in millions, except per share amounts ) year ended june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71, 074 $ 68, 041 service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168, 088 143, 015 cost of revenue : product 19, 064 18, 219 16, 017 service and other 43, 586 34, 013 30, 061 total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620 115, 856 96, 937 research and development 24, 512 20, 716 19, 269 sales and marketing 21, 825 20, 117 19, 598 general and administrative 5, 900 5, 107 5, 111 operating income 83, 383 69, 916 52, 959 other income, net 333 1, 186 77 income before income taxes 83, 716 71, 102 53, 036 provision for income taxes 10, 978 9, 831 8, 755']

uris:
  None

data:
  None

included:
  Item 1: metadatas
  Item 2: documents
  Item 3: distances

revenue, classified by significant product and service offerings, was as follows : ( in millions ) year ended june 30, 2022 2021 2020 server products and cloud services $ 67, 321 $ 52, 589 $ 41, 379 office products and cloud services 44, 862 39, 872 35, 316 windows 24, 761 22, 488 21, 510 gaming 16, 230 15, 370 11, 575 linkedin 13, 816 10, 289 8, 077 search and news advertising 11, 591 9, 267 8, 524 enterprise services 7, 407 6, 943 6, 409 devices 6, 991 6, 791 6, 457 other 5, 291 4, 479 3, 768 total $ 198, 270 $ 168, 088 $ 143, 015 we have recast certain previously reported amounts in the table above to conform to the way we internally manage and monitor our business.

74 note 13 — unearned revenue unearned revenue by segment was as follows : ( in millions ) june 30, 2022 2021 productivity and business processes $ 24, 558 $ 22, 120 intelligent cloud 19, 371 17, 710 more personal computing 4, 479 4, 311 total $ 48, 408 $ 44, 141 changes in unearned revenue were as follows : ( in millions ) year ended june 30, 2022 balance, beginning of period $ 44, 141 deferral of revenue 110, 455 recognition of unearned revenue ( 106, 188 ) balance, end of period $ 48, 408 revenue allocated to remaining performance obligations, which includes unearned revenue and amounts that will be invoiced and recognized as revenue in future periods, was $ 193 billion as of june 30, 2022, of which $ 189 billion is related to the commercial portion of revenue. we expect to recognize approximately 45 % of this revenue over the next 12

that are not sold separately. • we tested the mathematical accuracy of management ’ s calculations of revenue and the associated timing of revenue recognized in the financial statements.

82 in addition, certain costs incurred at a corporate level that are identifiable and that benefit our segments are allocated to them. these allocated costs include legal, including settlements and fines, information technology, human resources, finance, excise taxes, field selling, shared facilities services, and customer service and support. each allocation is measured differently based on the specific facts and circumstances of the costs being allocated. segment revenue and operating income were as follows during the periods presented : ( in millions ) year ended june 30, 2022 2021 2020 revenue productivity and business processes $ 63, 364 $ 53, 915 $ 46, 398 intelligent cloud 75, 251 60, 080 48, 366 more personal computing 59, 655 54, 093 48, 251 total $ 198, 270 $ 168, 088 $ 143, 015 operating income

47 financial statements and supplementary data income statements ( in millions, except per share amounts ) year ended june 30, 2022 2021 2020 revenue : product $ 72, 732 $ 71, 074 $ 68, 041 service and other 125, 538 97, 014 74, 974 total revenue 198, 270 168, 088 143, 015 cost of revenue : product 19, 064 18, 219 16, 017 service and other 43, 586 34, 013 30, 061 total cost of revenue 62, 650 52, 232 46, 078 gross margin 135, 620 115, 856 96, 937 research and development 24, 512 20, 716 19, 269 sales and marketing 21, 825 20, 117 19, 598 general and administrative 5, 900 5, 107 5, 111 operating income 83, 383 69, 916 52, 959 other income, net 333 1, 186 77 income before income taxes 83, 716 71, 102 53, 036 provision for income taxes 10, 978 9, 831 8, 755
botchagalupeai commented 2 months ago

I think I figured this out. However, from a documentation perspective it can be confusing. From my notes:

Vectors are Already Normalized

This formula shows that the Euclidean distance is directly related to the cosine similarity when the vectors are normalized. If the vectors are very similar (i.e., the angle between them is small), both metrics will produce very similar results.

Small Magnitudes of Difference When the vectors are extremely close to each other, the differences in the metrics may be very small, leading to nearly identical results. This is particularly true if the vectors have small differences in their components, resulting in both L2 distance and cosine similarity producing similar outcomes.

Unless someone has a better answer I'm sticking with this answer.