chroma-core / chroma

the AI-native open-source embedding database
https://www.trychroma.com/
Apache License 2.0
15.57k stars 1.3k forks source link

[Bug]: Sanitize OpenAI input #1503

Open tazarov opened 11 months ago

tazarov commented 11 months ago

What happened?

If texts passed to OpenAI API are breaking the JSON our wrapper sends 400 error is returned. Maybe there is a way we can sanitize output or leave it to openai lib to fix that.

https://discord.com/channels/1073293645303795742/1183798062863372318

Versions

Any

Relevant log output

ERROR:root:Error: Error code: 400 - {'error': {'message': "'$.input' is invalid. Please check the API reference: https://platform.openai.com/docs/api-reference.", 'type': 'invalid_request_error', 'param': None, 'code': None}}
Traceback (most recent call last):
  File "/home/matischroder/sanatorio_allende/upload_web_chroma.py", line 94, in <module>
    main()
  File "/home/matischroder/sanatorio_allende/upload_web_chroma.py", line 87, in main
    collection_id = train_with_url(
  File "/home/matischroder/sanatorio_allende/upload_web_chroma.py", line 77, in train_with_url
    raise e
  File "/home/matischroder/sanatorio_allende/upload_web_chroma.py", line 66, in train_with_url
    chroma_collection.add(
  File "/home/matischroder/sanatorio_allende/venv/lib/python3.9/site-packages/chromadb/api/models/Collection.py", line 147, in add
    embeddings = self._embed(input=documents)
HammadB commented 11 months ago

For my understanding - What are the cases where the json serialization in openais lib (assuming this is coming from there) breaks?

tazarov commented 11 months ago

According to elmatero in discord:

image

Going to test it out and see if reproducible.

beggers commented 11 months ago

I'm a little confused -- is this an issue with the OpenAI API itself, with OpenAI's python client, or with our EmbeddingFunction which wraps OpenAI's python library?

HammadB commented 11 months ago

I think OpenAI will throw an error if given None, is that correct @tazarov ?