SuperDuperDB / superduperdb

🔮 SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
https://superduperdb.com
Apache License 2.0
4.55k stars 445 forks source link

[BUG-0.2.0]: Errors When Inserting Artifact Encodable Type Data #2125

Closed jieguangzhou closed 1 month ago

jieguangzhou commented 1 month ago

Contact Details [Optional]

No response

System Information

main

What happened?

There were errors when inserting Artifact Encodable type data:

  1. The data conversion for encode and decode was not handled.
  2. Additional information was deleted when saved by ibis.

Steps to reproduce

from superduperdb import Schema, superduper

db = superduper("mongomock://test_db")

import pandas as pd

from superduperdb.components.datatype import pickle_serializer
from superduperdb.components.table import Table

df = pd.DataFrame(
    [{"a": 1, "b": 2}, {"a": 3, "b": 4}, {"a": 5, "b": 6}, {"a": 7, "b": 8}]
)

schema = Schema(identifier="schema", fields={"x": pickle_serializer})
table_or_collection = Table("documents", schema=schema)
db.apply(table_or_collection)

collection = db["documents"]
collection.insert([{"x": df}]).execute()

loaded_df = list(db.execute(collection.select()))[0].unpack()["x"]

assert df.equals(loaded_df)

Relevant log output

2024-May-30 21:22:38.06| INFO     | zhouhaha-2.local| superduperdb.base.build:69   | Data Client is ready. mongomock.MongoClient('localhost', 27017)
2024-May-30 21:22:38.07| INFO     | zhouhaha-2.local| superduperdb.base.build:42   | Connecting to Metadata Client with engine:  mongomock.MongoClient('localhost', 27017)
2024-May-30 21:22:38.07| INFO     | zhouhaha-2.local| superduperdb.base.build:155  | Connecting to compute client: None
2024-May-30 21:22:38.07| INFO     | zhouhaha-2.local| superduperdb.base.datalayer:85   | Building Data Layer
2024-May-30 21:22:38.08| INFO     | zhouhaha-2.local| superduperdb.base.build:220  | Configuration:
 +---------------+---------------------+
| Configuration |        Value        |
+---------------+---------------------+
|  Data Backend | mongomock://test_db |
+---------------+---------------------+
2024-May-30 21:22:38.08| WARNING  | zhouhaha-2.local| superduperdb.misc.annotations:117  | add is deprecated and will be removed in a future release.
2024-May-30 21:22:38.09| WARNING  | zhouhaha-2.local| superduperdb.misc.annotations:117  | add is deprecated and will be removed in a future release.
2024-May-30 21:22:38.09| INFO     | zhouhaha-2.local| superduperdb.backends.local.compute:37   | Submitting job. function:<function callable_job at 0x1046a7740>
Traceback (most recent call last):
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/ipy.py", line 19, in <module>
    collection.insert([{"x": df}]).execute()
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/backends/base/query.py", line 380, in execute
    return self.db.execute(self, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/base/datalayer.py", line 326, in execute
    return self._insert(query, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/base/datalayer.py", line 391, in _insert
    return inserted_ids, self.refresh_after_update_or_insert(
                         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/base/datalayer.py", line 449, in refresh_after_update_or_insert
    task_workflow.run_jobs()
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/jobs/task_workflow.py", line 67, in run_jobs
    job(
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/jobs/job.py", line 154, in __call__
    self.submit(dependencies=dependencies)
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/jobs/job.py", line 127, in submit
    self.future = self.db.compute.submit(
                  ^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/backends/local/compute.py", line 38, in submit
    future = function(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/jobs/tasks.py", line 108, in callable_job
    raise e
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/jobs/tasks.py", line 103, in callable_job
    output = function_to_call(*args, db=db, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/misc/download.py", line 404, in download_content
    documents = list(db.execute(select))
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/base/cursor.py", line 72, in __next__
    return Document.decode(r, db=self.db, schema=self.schema)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/base/document.py", line 139, in decode
    r = schema.decode_data(r)
        ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/components/schema.py", line 124, in decode_data
    decoded[k] = field.decode_data(data[k])
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/components/component.py", line 389, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/components/datatype.py", line 265, in decode_data
    return self.decoder(item, info=info) if self.decoder else item
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zhouhaha/workspace/SuperDuperDB/superduperdb/superduperdb/components/datatype.py", line 75, in pickle_decode
    return pickle.loads(b)
           ^^^^^^^^^^^^^^^
jieguangzhou commented 1 month ago

Currently, the process of saving and reading Artifacts when used as a component’s property versus when used as data is inconsistent, leading to a series of hidden issues.