SuperDuperDB / superduperdb

đź”® SuperDuperDB: Bring AI to your database! Build, deploy and manage any AI application directly with your existing data infrastructure, without moving your data. Including streaming inference, scalable model training and vector search.
https://superduperdb.com
Apache License 2.0
4.54k stars 444 forks source link

Optimize the encoding efficiency of `encode()` #2165

Open jieguangzhou opened 2 weeks ago

jieguangzhou commented 2 weeks ago

For the built-in Leaf object in superduperdb, reduce the amount of information through special references

For example

Now

from superduperdb.components.datatype import pickle_serializer
from superduperdb import Document
Document({'id': 123, 'x': pickle_serializer('This is a test')}).encode()

We get

{'id': 123,
 'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
 '_builds': {'pickle': {'_path': 'superduperdb/components/datatype/get_serializer',
   'method': 'pickle',
   'encodable': 'artifact',
   'type_id': 'datatype',
   'version': None,
   'uuid': '6b928f3c-ccfa-43eb-96ee-ae38bd8430e3'},
  '866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
   'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
   'datatype': '?pickle',
   'uri': None,
   'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
 '_files': {},
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

To


{'id': 123,
 'x': '?866cf8526595d3620d6045172fb16d1efefac4b1',
 '_builds': {'866cf8526595d3620d6045172fb16d1efefac4b1': {'_path': 'superduperdb/components/datatype/Artifact',
   'uuid': 'b28469b8-cb63-4df1-972c-b17d11eb5abd',
   'datatype': '&:superduperdb:datatype:pickle',
   'uri': None,
   'blob': '&:blob:866cf8526595d3620d6045172fb16d1efefac4b1'}},
 '_files': {},
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

Furthermore, we can even remove _builds:866cf8526595d3620d6045172fb16d1efefac4b1, because everything is built-in. As long as we have better protocol, it will eventually become xxxx.

 {'id': 123,
 'x': '&:protocol:{Artifact(datatype=&datatpye/pickle, blob=&:blob:866cf8526595d3620d6045172fb16d1efefac4b1)}',
 '_blobs': {'866cf8526595d3620d6045172fb16d1efefac4b1': b'\x80\x04\x95\x12\x00\x00\x00\x00\x00\x00\x00\x8c\x0eThis is a test\x94.'}}

Ultimately, this protocol should have the following characteristics:

  1. Improve information compression rate by utilizing the following mechanisms:

    1. db.metadata, such as &:component:
    2. db.artifact, such as &:blob: / &:file:
    3. superduperdb’s codebase, such as &:new_type:
    4. ...
  2. The encoded information should be readable and meaningful.