Closed duarteocarmo closed 7 months ago
@override
def select_using_ids(self, ids: t.Sequence[str]) -> Find:
args = [*self.args, {}, {}][:2]
args[0] = {'_id': {'$in': [ObjectId(_id) for _id in ids]}, **args[0]}
return Find(
collection=self.collection,
like_parent=self.like_parent,
args=args,
kwargs=self.kwargs,
)
SuperDuperDB will convert the id to ObjectId before query data, so if we use ObjectId format _id to search the String format _id we cannot search the data.
I think we should not convert the _id format before query data. Because if there are a existing collection, we probably cannot change the schema
But we can also fix this example by checking whether the inserted document contains _id, and if so, convert it to ObjectId.
WDYT @thejumpman2323 @duarteocarmo
Contact Details [Optional]
me@duarteocarmo.com
System Information
{ "cfg": { "apis": { "providers": {}, "retry": { "stop_after_attempt": 2, "wait_max": 10.0, "wait_min": 4.0, "wait_multiplier": 1.0 } }, "cdc": false, "dask": { "password": "", "port": 8786, "username": "", "ip": "localhost", "deserializers": [], "serializers": [], "local": true }, "data_layers": { "artifact": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "_filesystem:test_db" }, "data_backend": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "test_db" }, "metadata": { "cls": "mongodb", "connection": "pymongo", "kwargs": { "password": "", "port": 27017, "username": "", "host": "localhost" }, "name": "test_db" } }, "distributed": false, "logging": { "level": "INFO", "type": "STDERR", "kwargs": {} }, "model_server": { "password": "", "port": 5001, "username": "", "host": "127.0.0.1" }, "notebook": { "ip": "0.0.0.0", "password": "", "port": 8888, "token": "" }, "server": { "host": "127.0.0.1", "port": 3223, "protocol": "http" }, "vector_search": { "host": "localhost", "password": "", "port": 19530, "type": { "backfill_batch_size": 100, "inmemory": true }, "backfill_batch_size": 100, "username": "" }, "verbose": false, "downloads": { "hybrid": false, "root": "data/downloads" } }, "cwd": "/Users/duarteocarmo/Repos/thechangelogbot-backend", "git": { "branch": "('branch', '--show-current') failed with [Errno 2] No such file or directory: 'branch'", "commit": "('show', '-s', '--format=\"%h: %s\"') failed with [Errno 2] No such file or directory: 'show'" }, "hostname": "duartes-macbook-pro.home", "os_uname": [ "Darwin", "duartes-macbook-pro.home", "22.4.0", "Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020", "arm64" ], "package_versions": {}, "platform": { "platform": "macOS-13.3.1-arm64-arm-64bit", "python_version": "3.10.11" }, "startup_time": "2023-08-22 18:15:16.679710", "superduper_db_root": "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages", "sys": { "argv": [ "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages/superduperdb/main.py", "info" ], "path": [ "/Users/duarteocarmo/Repos/thechangelogbot-backend", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python310.zip", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10", "/Users/duarteocarmo/.asdf/installs/python/3.10.11/lib/python3.10/lib-dynload", "/Users/duarteocarmo/Repos/thechangelogbot-backend/.env/lib/python3.10/site-packages", "/Users/duarteocarmo/Repos/thechangelogbot-backend/src" ] } }
What happened?
I really hate being a pain in the *ss guys. But here goes. When overriding the
_id
field in pymongo, I'm not able to run the vector search.Why am I overriding it? Because I would like to only send new items to the DB, and avoid storing things that I have already stored. In this particular case the
_id
is based on a hash of the text to embed.See the problematic lines with comments.
I tried digging around what happened, but it seems that the
_outputs
key is simply not there.Steps to reproduce
run the example.
Relevant log output