chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.63k stars 1.31k forks source link

[Bug]: Chroma v0.5.0 unnecessarily replaces newline characters with spaces before generating embeddings #2129

Open dasheffie opened 7 months ago

dasheffie commented 7 months ago

What happened?

Chroma removes newline characters before generating embeddings in Chroma v0.5.0, even though this is now unnecessary (post-V1 models), negatively impacts similarity search results, and makes it more difficult to predict outputs (openai issue 418, langchain issue 3853).

In openai issue 418 BorisPower explains that the preprocessing of newline characters should be removed because it is no longer needed for models like "text-embedding-ada-002". However, if you run the code below, you will see that chroma is still replacing newline characters with spaces before generating embeddings, leading to embeddings that differ from the embeddings generated from the openai package.

Also, could someone please confirm that the only pre-processing of text before embedding that happens in chroma is the replacement of newline characters? We do not feel comfortable using a chroma embedding function for our DB unless the preprocessing is transparent.

import chromadb.utils.embedding_functions as embedding_functions
from openai import AzureOpenAI
import numpy as np
import os
import chromadb
from typing import List

deployment_name_embeddings = "text-embedding-ada-002"

chroma_embedding_api_creds = dict(
    api_type = os.getenv('OPENAI_API_TYPE_EMB'),
    api_base = os.getenv('OPENAI_API_BASE_EMB'),
    api_version = "2024-02-01",
    api_key = os.getenv('OPENAI_API_KEY_EMB'),
)
chroma_embedding_function = embedding_functions.OpenAIEmbeddingFunction(model_name=deployment_name_embeddings, **chroma_embedding_api_creds)

openai_client = AzureOpenAI(
  api_key = os.getenv('OPENAI_API_KEY_EMB'),  
  api_version = "2024-02-01",
  azure_endpoint = os.getenv('OPENAI_API_BASE_EMB')
)

def get_embedding(text, model=deployment_name_embeddings):
    return openai_client.embeddings.create(input = [text], model=model).data[0].embedding

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

client = chromadb.PersistentClient(path='chroma_collection', )
collection = client.create_collection(name='chroma_collection', embedding_function=chroma_embedding_function)

chunks = [
    {
        "page_content": "Chroma Rocks!!!",
        "metadata": {
            "source": "chunk1",
            "token_count": 15
        },
    },
    {
        "page_content": "Chroma\nRocks!!!",
        "metadata": {
            "source": "chunk2",
            "token_count": 15
            },
    },
]

collection.add(
    documents = [chunk_dict['page_content'] for chunk_dict in chunks],
    metadatas = [chunk_dict['metadata'] for chunk_dict in chunks],
    ids = [chunk_dict['metadata']['source'] for chunk_dict in chunks])

collection_output = collection.get(include=['embeddings', ])
for chunk, chroma_collection_embedding in zip(chunks, collection_output['embeddings']):
    chunk['openai_embedding'] = get_embedding(chunk['page_content'])
    chunk['chroma_collection_embedding'] = chroma_collection_embedding
    # chunk['chroma_fn_embedding'] = chroma_embedding_function([chunk['page_content']])

# compare embeddings from chroma and openai
client.delete_collection('chroma_collection')
print('First chunk...')
print(f"text from first chunk: {chunks[0]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[0]['openai_embedding'], chunks[0]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[0]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[0]['chroma_collection_embedding'][0]}")
print('Second chunk...')
print(f"text from second chunk: {chunks[1]['page_content'][:30]!r}")
print(f"cosine similarity: {round(cosine_similarity(chunks[1]['openai_embedding'], chunks[1]['chroma_collection_embedding']), 5)}")
print(f"First number of `openai` embedding: {chunks[1]['openai_embedding'][0]}")
print(f"First number of `chroma` collecton embedding: {chunks[1]['chroma_collection_embedding'][0]}")

# output
First chunk...
text from first chunk: 'Chroma Rocks!!!'
cosine similarity: 1.0
First number of `openai` embedding: 0.01531514897942543
First number of `chroma` collecton embedding: 0.01531514897942543
Second chunk...
text from second chunk: 'Chroma\nRocks!!!'
cosine similarity: 0.97234
First number of `openai` embedding: 0.023534949868917465
First number of `chroma` collecton embedding: 0.01531514897942543

Versions

Chroma v0.5.0, Python 3.11.7, Debian 12

Relevant log output

No response

tazarov commented 7 months ago

@dasheffie, linking the PR for this - #2125

gbarton commented 1 month ago

I ran into this today, is there a workaround to not have this behavior? I wanted to use the pageContent (javascript api via langchain) as an actual content store and present the data as it was stored (think users notes on a topic/thing) but the newlines are getting crushed on insert.

I could dump another copy into the metadata, but that seems wasteful?

tazarov commented 1 month ago

@gbarton, you can use Langchain's Embeddings for this. Chroma has an adapter for it:

# pip install chromadb==0.5.13 langchain langchain-openai langchain-chroma
import chromadb
from chromadb.utils.embedding_functions import create_langchain_embedding
from langchain_openai import OpenAIEmbeddings

langchain_embeddings = OpenAIEmbeddings(
    model="text-embedding-3-large",
    api_key=os.environ["OPENAI_API_KEY"],
)
ef = create_langchain_embedding(langchain_embeddings)
client = chromadb.PersistentClient(path="/chroma-data")
collection = client.get_or_create_collection(name="my_collection", embedding_function=ef)

collection.add(ids=["1"],documents=["test document goes here"])
gbarton commented 1 month ago

Thank you for your reply! I do currently use my own embeddings, is it meant to bypass the newline ripping? Forgot to clarify I'm using langchainjs in a webservice. Its pretty similar, something like:


import { VectorStore } from "@langchain/core/vectorstores";
import { Chroma } from "@langchain/community/vectorstores/chroma";

const embeddings = new OllamaEmbeddings({
  model: embeddingModel,
  baseUrl
});

const store = new Chroma(embeddings, {
  collectionName: "store",
  url: endpoint,
});

/**
 * splits the document into smaller pieces
 */
private async split(document: Document) {
  const transformer = new HtmlToTextTransformer();
  const sequence = transformer.pipe(new RecursiveCharacterTextSplitter({
    chunkSize: 1000,
    chunkOverlap: 0,
  }));

  return await sequence.invoke([document]);
}

/**
 * Splits the document into smaller, optimized pieces and writes them to the document store.
 * This method employs a smart splitting strategy to ensure efficient storage of documents.
 */
private async write(document: Document) {
  const docs = await this.split(document);
  const ids = docs.map((d) => uuidv4());

  await store.addDocuments(docs, { ids });
  return docs;
}

async create(doc: CRMDoc) {
  const startTime = new Date().getTime();
  let content = doc.text;
  if (content.length == 0) {
    return; // TODO: notify error
  }

  if (content.length == 0) {
    return; // TODO: notify error
  }

  const { text, ...rest} = doc;

  const document: Document = {
    pageContent: content,
    metadata: rest,
  }

  let docs = await write(document);
}
gbarton commented 1 month ago

actually using my own embeddings does work. The answer was in my split function, HtmlToTextTransformer or the RecursiveCharacterTextSplitter is also stripping out newlines. Thanks for pointing me in the right direction :)

tazarov commented 1 month ago

@gbarton, I've now added LangchainJS Embeddings integration that will help with your case #2945

gbarton commented 1 month ago

thank you! Much appreciated :)

dan-vine commented 1 month ago

I just want add that changing newlines to spaces, also effects the number of tokens and thus does not allow to compute whether the original text still fits into the context-length of the OpenAI model.

In my case this caused a bug when switching from a different embedding function to the OpenAI one.