SuperDuperDB / superduperdb

🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
https://superduperdb.com
Apache License 2.0
4.53k stars 443 forks source link

[BUG]: Vector search fails when overriding `_id` #750

Closed duarteocarmo closed 7 months ago

duarteocarmo commented 10 months ago

Contact Details [Optional]

me@duarteocarmo.com

System Information

{ "cfg": { "apis": { "providers": {}, "retry": { "stop_after_attempt": 2, "wait_max": 10.0, "wait_min": 4.0, "wait_multiplier": 1.0 } }, "cdc": false, "dask": { "password": "", "port": 8786, "username": "", "ip": "localhost", "deserializers": [], "serializers": [], "local": true }, "data_layers": { "artifact": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "_filesystem:test_db" }, "data_backend": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "test_db" }, "metadata": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "test_db" } }, "distributed": false, "logging": { "level": "INFO", "type": "STDERR", "kwargs": {} }, "model_server": { "password": "", "port": 5001, "username": "", "host": "127.0.0.1" }, "notebook": { "ip": "0.0.0.0", "password": "", "port": 8888, "token": "" }, "server": { "host": "127.0.0.1", "port": 3223, "protocol": "http" }, "vector_search": { "host": "localhost", "password": "", "port": 19530, "type": { "backfill_batch_size": 100, "inmemory": true }, "backfill_batch_size": 100, "username": "" }, "verbose": false, "downloads": { "hybrid": false, "root": "data/downloads" } }, "cwd": "/Users/duarteocarmo/Repos/thechangelogbot-backend", "git": { "branch": "('branch', '--show-current') failed with [Errno 2] No such file or directory: 'branch'", "commit": "('show', '-s', '--format=\"%h: %s\"') failed with [Errno 2] No such file or directory: 'show'" }, "hostname": "duartes-macbook-pro.home", "os_uname": [ "Darwin", "duartes-macbook-pro.home", "22.4.0", "Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020", "arm64" ], "package_versions": {}, "platform": { "platform": "macOS-13.3.1-arm64-arm-64bit", "python_version": "3.10.11" }, "startup_time": "2023-08-22 18:15:16.679710", "superduper_db_root": "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages", "sys": { "argv": [ "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/main.py", "info" ], "path": [ "/Users/duarteocarmo/Repos/thechangelogbot-backend", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python310.zip", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10/lib-dynload", "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages", "/Users/duarteocarmo/Repos/thechangelogbot-backend/src" ] } }

What happened?

I really hate being a pain in the *ss guys. But here goes. When overriding the _id field in pymongo, I'm not able to run the vector search.

Why am I overriding it? Because I would like to only send new items to the DB, and avoid storing things that I have already stored. In this particular case the _id is based on a hash of the text to embed.

See the problematic lines with comments.

I tried digging around what happened, but it seems that the _outputs key is simply not there.

import sentence_transformers
import superduperdb as s
from pymongo import MongoClient
from superduperdb.container.document import Document
from superduperdb.container.listener import Listener
from superduperdb.container.model import Model
from superduperdb.container.vector_index import VectorIndex
from superduperdb.db.mongodb.query import Collection
from superduperdb.ext.numpy.array import array

MODEL_NAME = "all-MiniLM-L6-v2"
VECTOR_SIZE = 384

IDENTIFIER_ID = "my-index"
COLLECTION_NAME = "docs"

client = MongoClient("localhost", 27017)
db = s.superduper(client.documents)
collection = Collection(name=COLLECTION_NAME)

data = [
    {
        "podcast_name": "news",
        "episode_number": 29,
        "text": "**Jerod Santo:** What up nerds, I'm Jerod and this is Changelog News for the week of Monday, January 30th 2023! Our Monday news brief experiment has been pretty successful,",
        "speaker": "Jerod Santo",
        "_id": "fcef4745e8acbac770020c2c", # if commented, works as expected
        "num_words": 29,
    },
    {
        "podcast_name": "news",
        "episode_number": 29,
        "text": "**Jerod Santo:** That means it'll get its own name and its own podcast feed, amongst other changes and improvements. Curious: would you be upset by this? Happy? Would you subscribe to this as a separate podcast or nah? Let me know in the comments, or on the socials (jerodsanto on twitter, jerod@changelog.social on Mastodon), or via email at jerod@changelog.com Oh, and of course if you're subscribed to Changelog++ or our master feed, you might not even notice this change, except for new show art and stuff like that. So that's cool. Ok, let's get into the news.",
        "speaker": "Jerod Santo",
        "_id": "bd1ffb9f1a66edede3802b51",  # if commented, works as expected
        "num_words": 97,
    },
    {
        "podcast_name": "news",
        "episode_number": 29,
        "text": "**Jerod Santo:** Jeremia Kimelman, a data scientist and recovering web developer living in Sacramento, California, took stock of his \"data tool belt\", writing up twelve software projects and companies he uses all the time as a working data journalist. Side note: this style blog post is awesome. It's always interesting to learn what tools people are using and why. Also, they're pretty easy to write. You just look around at what you use on the regular, make a list, and write a bit about each tool. If you have blogger's block, well, now you don't! Ok, back to Jeremia. He broke his tools down into five categories: general-use, web scraping, geospatial, website, and tools that are also companies. Some of these you may be familiar with, like D3 and lodash, but others are more obscure: like cheerio, and turfjs. Check out the full list on his blog and if you end up taking stock of your own tool belt, let us know about it, will ya?",
        "speaker": "Jerod Santo",
        "_id": "15fddd23446658677a745c97",  # if commented, works as expected
        "num_words": 166,
    },
]

model = Model(
    identifier=MODEL_NAME,
    object=sentence_transformers.SentenceTransformer(MODEL_NAME),
    encoder=array("float32", shape=(VECTOR_SIZE,)),
    predict_method="encode",
    batch_predict=True,
)

db.add(
    VectorIndex(
        identifier=IDENTIFIER_ID,
        indexing_listener=Listener(
            model=model,
            key="text",
            select=Collection(name=COLLECTION_NAME).find(),
        ),
    )
)

print(db.show("listener"))
print(db.show("model"))
print(db.show("vector_index"))

data = [Document(r) for r in data]
db.execute(collection.insert_many(data))

query = "fighting"
cur = db.execute(
    Collection(name=COLLECTION_NAME).like(
        {"text": query}, n=2, vector_index=IDENTIFIER_ID
    )
)

for r in cur:
    print(r["text"])
    print("======")

Steps to reproduce

run the example.

Relevant log output

INFO:sentence_transformers.SentenceTransformer:Load pretrained SentenceTransformer: all-MiniLM-L6-v2
INFO:sentence_transformers.SentenceTransformer:Use pytorch device: cpu
INFO:root:Adding model all-MiniLM-L6-v2 to db
WARNING:root:model/all-MiniLM-L6-v2/0 already exists - doing nothing
INFO:root:Done.
0it [00:00, ?it/s]
Batches: 0it [00:00, ?it/s]
INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 0it [00:00, ?it/s]
['all-MiniLM-L6-v2/text']
['all-MiniLM-L6-v2']
['my-index']
INFO:root:found 0 uris
INFO:root:Adding model all-MiniLM-L6-v2 to db
WARNING:root:model/all-MiniLM-L6-v2/0 already exists - doing nothing
INFO:root:Done.
Batches: 0it [00:00, ?it/s]
INFO:root:loading hashes: 'my-index'
Loading vectors into vector-table...: 0it [00:00, ?it/s]Traceback (most recent call last):
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/ola.py", line 80, in <module>
    cur = db.execute(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 280, in execute
    return self.like(query)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 338, in like
    return like(self)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/mongodb/query.py", line 222, in __call__
    ids, scores = db._select_nearest(
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 1001, in _select_nearest
    vector_index = self.vector_indices[vector_index]
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 1020, in __missing__
    value = self[key] = self.database.load(self.field, key)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/db/base/db.py", line 516, in load
    m.on_load(self)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/vector_index.py", line 86, in on_load
    self._initialize_vector_database(db)
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/vector_index.py", line 195, in _initialize_vector_database
    h = record.outputs(key, self.indexing_listener.model.identifier)  # type: ignore[union-attr]
  File "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/container/document.py", line 43, in outputs
    document = self.unpack()[_OUTPUTS_KEY][key][model]
jieguangzhou commented 10 months ago
    @override
    def select_using_ids(self, ids: t.Sequence[str]) -> Find:
        args = [*self.args, {}, {}][:2]
        args[0] = {'_id': {'$in': [ObjectId(_id) for _id in ids]}, **args[0]}

        return Find(
            collection=self.collection,
            like_parent=self.like_parent,
            args=args,
            kwargs=self.kwargs,
        )

SuperDuperDB will convert the id to ObjectId before query data, so if we use ObjectId format _id to search the String format _id we cannot search the data.

image

I think we should not convert the _id format before query data. Because if there are a existing collection, we probably cannot change the schema

But we can also fix this example by checking whether the inserted document contains _id, and if so, convert it to ObjectId.

WDYT @thejumpman2323 @duarteocarmo