dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.38k stars 154 forks source link

Don't use Custom Embedding Functions #1771

Closed Pipboyguy closed 3 weeks ago

Pipboyguy commented 1 month ago

Description

OpenAI embedding service doesn't accept empty string bodies. We used to deal with this by overriding the whole OpenAIEmbedding function.

This caused more grief than it fixed since the LanceDB registry doesn't keep track of it well, with very finicky Arrow metadata parsing and de-serialisation.

We simplify this fix by simply replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries. Ideally, the null strings' embedding vectors themselves should be pinned at the origin, but this should be handled by upstream LanceDB.

The default vector column name is also changed to simply "vector" to coincide with LanceDB's default vector name to make onboarding and setup easier.

Related Issues

Additional Context

See https://github.com/lancedb/lancedb/issues/1577

netlify[bot] commented 1 month ago

Deploy Preview for dlt-hub-docs ready!

Name Link
Latest commit 9a347e63d95f8fa451190a42ed9c0f4f33fca769
Latest deploy log https://app.netlify.com/sites/dlt-hub-docs/deploys/66d3421eb1e42d00088bb147
Deploy Preview https://deploy-preview-1771--dlt-hub-docs.netlify.app
Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.