SmithSamuelM commented 3 years ago

LMDB Keyspace Python Object Mapper (KOM)

The current keripy use of LMDB has been mostly limited to KELs for the KERI Core. However in developing applications that sit on top of the KERI core, it is useful to have a more generic CRUD like database interface. LMDB is a lexicographically ordered key value store (arguably the most performant of this class). The Python wrapper over the LMDB c language implementation is very low level and does provide any syntactic sugar to make it easy to map python object instances to values in the database. This is a proposed design for a Python factory class that maps Python dataclass instances as serialized values to entries in the key space of of given LMDB database. Hence for short in this proposal we are calling it a Key space Object Mapper or (KOM) (or Key-value-store Object Mapper).

Python dataclasses as combined schema and instances for serializable data

Support for Python data classes was added in Python 3.7. Essentially a dataclass provides a convenient way of creating instances according to a data schema. The data schema uses the Python type hints in the class definition. This provides a very convenient and compact initialization method that does not require any external schema to declare the attribute types in the subclass. The subclass definition serves as both schema definition and attribute declaration.

For example keripy currently uses data classes in the keri.base.keeping module to define schema for serializing information about (public, private) key pairs for managing those keys in an LMDB key store. Here is the PubLot class definition:

@dataclass()
class PubLot:
    """
    Public key list with indexes and datetime created
    """
    pubs: list = field(default_factory=list)  # list of fully qualified Base64 public keys. defaults to empty .
    ridx: int = 0  # index of rotation (est event) that uses public key set
    kidx: int = 0  # index of key in sequence of public keys
    st:   Union[str, int, list] = '0' # signing threshold as either:
                    # int or str hex of int such as '2' or list of weights
    dt:   str = ""  # datetime ISO8601 when key set created

    def __iter__(self):
        return iter(asdict(self))

The @dataclass decorator converts the class definition into a compliant subclass with init method etc. It provides syntactic suger that translates the unique syntax in the class definition to a python subclass with and init method that creates instance attributes for each declared field. The dataclass definition syntax is essentially a schema definition for members of that class.

To instantiate a PubLot just call the class with arguments corresponding to the defined fields. For example:

nxt=PubLot(pubs=[signer.verfer.qb64 for signer in nsigners], ridx=ridx+1, kidx=kidx+len(icodes), st=nst, dt=dt)

To serialize use the dataclass class method dataclass.asdict() to convert to a dictionary that may be serialized with JSON,MsgPack, or CBOR.

val=json.dumps(asdict(nxt)).encode("utf-8")

Python data classes by have nested schema definitions. For example the keri.base.keeping.PreSit dataclass has fields that are PubLot dataclass instances :


@dataclass()
class PreSit:
    """
    Prefix's public key situation (sets of public kets)
    """
    old: PubLot = field(default_factory=PubLot)  # previous publot
    new: PubLot = field(default_factory=PubLot)  # newly current publot
    nxt: PubLot = field(default_factory=PubLot)  # next public publot

    def __iter__(self):
        return iter(asdict(self))

A PreSit instance may be initialized and serialized the same way.

ps = PreSit(
                    new=PubLot(pubs=[verfer.qb64 for verfer in verfers],
                                   ridx=ridx, kidx=kidx, st=cst, dt=dt),
                    nxt=PubLot(pubs=[signer.verfer.qb64 for signer in nsigners],
                                   ridx=ridx+1, kidx=kidx+len(icodes), st=nst, dt=dt))

val=json.dumps(asdict(ps)).encode("utf-8")

KOM

A notional KOM class is a database factory that creates instances of CRUD database mappers. Each instance has a defined database schema for the entries in that database. The schema is expresses as a Python dataclass. The database mapper handles the serialization and deserialization of database instances and hides those details behind its CRUD like protocol of methods, namely, `.get, .put, .del).

Each KOM instance with unique schema dataclass has its own LMDB sub database or subdb. This is a feature of LMDB environments that allows partitioning of its key space. Each LMDB environment defines a master database that includes its whole key space. Subdatabases essentially are prefixes to keys in the key space that partition the master database key space. This enables each subdatabase to be treated like a table of a unique type. For example:

from typing import Type

class Komer(LMDBer):
    """
    Keyspace Object Mapper factory class
    """
    def __init__(schema: Type[dataclass], subdb: str, kind: str = 'JSON'):
        """
        Parameters:
            schema (dataclass):  reference to Class definition for dataclass sub class
            subdb (str):  LMDB sub database key
        """
        self.schema = schema
        self.subdb = self.env.open_db(key=bytes(subdb))

    def put(keys: tuple, data: dataclass):
        """
        Parameters:
            keys (tuple): of key strs to be combined in order to form key
            data (dataclass): instance of dataclass of type self.schema as value
        """ 
        if not is instance(data, self.schema):
            raise ValueError("Invalid schema type={} of data={}, expected {}."
                                   "".format(type(data), data, self.schema)
        self.putVal(db=self.subdb, 
                    key=":".join(keys), 
                    val=json.dumps(asdict(data)).encode("utf-8"))

@dataclass
class Record():
    first: str  # first name
    last: str   # last name
    street: str  # street address
    city: str   # city name
    state: str  # state code
    zip: int    # zip code

    def __iter__(self):
        return iter(asdict(self))

mydb = Komer(schema=Record, subdb='records')

sue = Record(first="Susan", 
             last="Black", 
             street="100 Main Street",
             city="Riverton",
             state="UT",
             zip=84058)

mydb.put(keys=("skskjgoshkdh","0001"), val=sue)

SmithSamuelM commented 3 years ago

The kind parameter allows different serialization/deseriazation types. This can be implemented by created a _dumps method that is used instead of explicitly calling json.jumps

self._dumps = json.dumps or the like

SmithSamuelM commented 3 years ago

the datify function in keri.help.helping converts a dict back into a dataclass instance

def datify(cls, d):
    """
    Returns instance of dataclass cls converted from dict d
    Parameters:
    cls is dataclass class
    d is dict
    """
    try:
        fieldtypes = {f.name: f.type for f in dataclasses.fields(cls)}
        return cls(**{f: datify(fieldtypes[f], d[f]) for f in d})  # recursive
    except:
        return d  # Not a dataclass field

ps = helping.datify(PreSit, json.loads(bytes(rawsit).decode("utf-8")))

This provides the deserialization side that would be used by the Komer.get method.

SmithSamuelM commented 3 years ago

The work item is to flesh out the .get and .del methods and to support JSON and MGPK, and maybe CBOR, and Pickle

m00sey commented 3 years ago

I've started this, del is reserved so it's rem right now.

SmithSamuelM commented 3 years ago

I just pushed the skeleton code above to the keripy. There were some subtleties in the python code that I fixed that might have misled you.

https://github.com/decentralized-identity/keripy/pull/148

Anyway the incomplete Komer object in in keri.base.basing and a basic test skeleton in tests.base.test_basing

But what there pasts the tests so it shouldn't be misleading any more. What I wrote above was just a sketch

SmithSamuelM commented 3 years ago

So Sophy does a couple of other things like support slicing which could be added to the Komer object. It also supports backup of the database which should be added to the LMDBer object not the Komer object. The way Komer is setup it may be used with any LMDBer database so it could be added on to an exiting database or a new created. Each Komer instance acts aa table in the db to which it is attached but in its own key space.

Using python dataclasses is more powerful expressive approach to schema than the approach Sophy used which requires defining custom classes and the schema of those classes is not apparent anywhere as it is with a dataclass definition.

Dataclasses give us a lot of power down the road, For example with dataclasses.make_dataclass() we could create a dataclass definition from a string that is parsed into tuples and then passed into make_dataclass(). We may never need to do that for our internal apps but it would allow some declarative coding for customization down the road.

SmithSamuelM commented 3 years ago

The json serializer could be made more compact to get rid of white space and use utf-8 instead of escaping when non-ascii.

>>> d = dict(a=1, b=2, c=3)
>>> json.dumps(d)
'{"a": 1, "b": 2, "c": 3}'
>>> json.dumps(d, separators=(",", ":"), ensure_ascii=False)
'{"a":1,"b":2,"c":3}'

decentralized-identity / keripy

Keyspace Object Mapper (KOM) #147

LMDB Keyspace Python Object Mapper (KOM)

Python dataclasses as combined schema and instances for serializable data

KOM