commontoolsinc / synopsys

datastore service for datums
0 stars 1 forks source link

User defined relations #16

Open Gozala opened 4 days ago

Gozala commented 4 days ago

Me and @seefeldb have explored interesting direction in which recipe is a effectively user defined relation https://github.com/Gozala/datalogia/issues/38

Which is really cool because it makes it possible to use them in queries, however there is a serious drawback, that is synopsys will only be able to execute such queries if it can run recipes that in turn imposes runtime that can load and execute such operators.

Perhaps operators as WASM blobs is going to be reasonably easy to do, but that sure does increase implementation bar.

⚠️ It is also unclear how that would affect DBSP. On the other hand if we materialize the dream of datomic like architecture where queries run in the client running arbitrary functions will be fairly trivial.

Gozala commented 14 hours ago

I have sketched out a rough interface of what formula's in WASM could look like. Intention is to allow formulas to model arbitrary relations e.g. one could say [?item rga.list ?item-list] and rga/list here would a program that provides Formula interface.

use std::collections::HashMap;

// Define the Scalar type as an enum
#[derive(Clone, Debug)]
enum Scalar {
    Null,
    Boolean(bool),
    String(String),
    Float(f64),
    Int(i64),
    Bytes(Vec<u8>),
    Entity(Entity),
}

// Define the Entity struct
#[derive(Clone, Debug)]
struct Entity(Vec<u8>);

// Define the Datum type as a tuple struct
#[derive(Clone, Debug)]
struct Datum {
    entity: Entity,
    attribute: String,
    value: Scalar,
    cause: Entity,
}

// DB range query type
#[derive(Clone, Debug)]
enum RangeQuery {
    ByEntity { entity: Entity, attribute: Option<String>, value: Option<Scalar> },
    ByAttribute { attribute: String, entity: Option<Entity>, value: Option<Scalar> },
    ByValue { value: Scalar, entity: Option<Entity>, attribute: Option<String> },
}

// Define the Formula trait
trait Formula {
    // Input is always a set of named scalars. Could be a record but I'm not sure how to
    // enforce that every field must be a scalar.
    type Input: Vec<(String, Scalar)>;
    // Output is also a set of named scalars, here as well it could be a record if we can
    // enforce constraint that all fields be scalars.
    type Output: Vec<(String, Scalar)>;
    // Local formula state used through the the transformation process.
    type State;

    // Init is called by the query engine when corresponig clause
    // is encountered. Here implementation can produce initial state
    // range query to request datums it will process.
    fn init(input: Self::Input) -> (Self::State, RangeQuery);

    // Step is called with set of datums that matched the range query. It may be called
    // iteratively passing batches of datums. At each call implementation can produce
    // derived set of facts and new state to be passed in subsequent step.
    fn step(state: &Self::State, datums: Vec<Datum>) -> (Self::State, Vec<Self::Output>);

    // Once all datums had been stepped through method is called allowing implementation
    // to flush out all the remaining datums if any.
    fn end(state: &Self::State) -> Vec<Self::Output>;
}
Gozala commented 14 hours ago

Moving a question from hackmd where this sketch lived prior

it's this because under the hood we model an ephemeral entity? as in: we could also have an identifier for an ephemeral input and output entity and then this array becomes triples on those.

one possible thing this could do is not require names for formulas that take only one scalar as input or output. e.g. an "rot13" formula that just reads a string and writes a string.

i thought this was the difference between

[ "?in", rot13, "?out" ] and [ { str: "?in" }, rot13, { str: "?out" } ]

Gozala commented 14 hours ago

Moving a question from hackmd where this sketch lived prior

it's this because under the hood we model an ephemeral entity? as in: we could also have an identifier for an ephemeral input and output entity and then this array becomes triples on those. one possible thing this could do is not require names for formulas that take only one scalar as input or output. e.g. an "rot13" formula that just reads a string and writes a string. i thought this was the difference between [ "?in", rot13, "?out" ] and [ { str: "?in" }, rot13, { str: "?out" } ]

Mostly this was to keep interface simple and not have to deal with the fact that we could operate on a unit of scalars and single scalar. I though we could pick a default name when it is a single scalar to keep it the interface simple

seefeldb commented 14 hours ago

Ah, in practice I've used operating on scalars pretty often and it was nice to not have to use a default name. Could be moved to sugar, but also feels like it could be here directly.

Gozala commented 14 hours ago

Ah, in practice I've used operating on scalars pretty often and it was nice to not have to use a default name. Could be moved to sugar, but also feels like it could be here directly.

I mean that query engine will do the boxing under the hood to keep formula implementations simple. This would be transparent for the query

Gozala commented 14 hours ago

I want to call out that proposed sketch for formula interface is just an educated guess of what might work, based on how actual attributes are implemented in the query engine. In other words with this interface I could potentially implement [?e attr("foo"), ?v] in user space as a formula.

That is to say it may turn out that interface has flaws and may not work for all use cases we may want to support. Furthermore @seefeldb already called out that we may want to have different interfaces for stateful and stateless formulas because:

  1. From IFC perspective there is a significant difference in capabilities.
  2. Stateless formulas can have a lot more optimized implementation

I think we need to do some research before deciding on the interface & here are few things I would suggest

jsantell commented 14 hours ago

I have sketched out a rough interface of what formula's in WASM could look like. Intention is to allow formulas to model arbitrary relations e.g. one could say [?item rga.list ?item-list] and rga/list here would a program that provides Formula interface.

use std::collections::HashMap;

// Define the Scalar type as an enum
#[derive(Clone, Debug)]
enum Scalar {
    Null,
    Boolean(bool),
    String(String),
    Float(f64),
    Int(i64),
    Bytes(Vec<u8>),
    Entity(Entity),
}

// Define the Entity struct
#[derive(Clone, Debug)]
struct Entity(Vec<u8>);

// Define the Datum type as a tuple struct
#[derive(Clone, Debug)]
struct Datum {
    entity: Entity,
    attribute: String,
    value: Scalar,
    cause: Entity,
}

// DB range query type
#[derive(Clone, Debug)]
enum RangeQuery {
    ByEntity { entity: Entity, attribute: Option<String>, value: Option<Scalar> },
    ByAttribute { attribute: String, entity: Option<Entity>, value: Option<Scalar> },
    ByValue { value: Scalar, entity: Option<Entity>, attribute: Option<String> },
}

// Define the Formula trait
trait Formula {
    // Input is always a set of named scalars. Could be a record but I'm not sure how to
    // enforce that every field must be a scalar.
    type Input: Vec<(String, Scalar)>;
    // Output is also a set of named scalars, here as well it could be a record if we can
    // enforce constraint that all fields be scalars.
    type Output: Vec<(String, Scalar)>;
    // Local formula state used through the the transformation process.
    type State;

    // Init is called by the query engine when corresponig clause
    // is encountered. Here implementation can produce initial state
    // range query to request datums it will process.
    fn init(input: Self::Input) -> (Self::State, RangeQuery);

    // Step is called with set of datums that matched the range query. It may be called
    // iteratively passing batches of datums. At each call implementation can produce
    // derived set of facts and new state to be passed in subsequent step.
    fn step(state: &Self::State, datums: Vec<Datum>) -> (Self::State, Vec<Self::Output>);

    // Once all datums had been stepped through method is called allowing implementation
    // to flush out all the remaining datums if any.
    fn end(state: &Self::State) -> Vec<Self::Output>;
}

Sketch of this in wit form:

package common:formula@0.0.1;

interface types {
  type %string = string;
  type boolean = bool;
  type float = f64;
  type int = s64;
  type buffer = list<u8>;

  resource entity {}
  resource state {}

  variant scalar {
    null,
    %string(%string),
    boolean(boolean),
    buffer(buffer),
    float(float),
    int(int),
    entity(entity),
  }

  record datum {
    attribute: string,
    value: scalar,
    cause: entity,
    entity: entity,
  }

  record entity-range-query {
    entity: entity,
    attribute: option<string>,
    value: option<scalar>,
  }

  record attribute-range-query {
    entity: option<entity>,
    attribute: string,
    value: option<scalar>,
  }

  record value-range-query {
    entity: option<entity>,
    attribute: option<string>,
    value: scalar,
  }

  variant range-query {
    entity(entity-range-query),
    attribute(attribute-range-query),
    value(value-range-query),
  }

  variant instruction {
    assert(scalar),
    retract(scalar),
    %import(scalar),
  }
}

world virtual-module {
  use types.{scalar, datum, instruction, state, range-query};
  import types;

  export set-source: func(source: string) -> result<_, string>;

  export init: func(input: list<tuple<string, scalar>>) -> result<tuple<state, range-query>, string>; 
  export step: func(state: state, datums: list<datum>) -> result<tuple<state, list<tuple<string, instruction>>>, string>;
  export end: func(state: state) -> result<list<tuple<string, instruction>>, string>;
}
Gozala commented 14 hours ago

I have being thinking that along with PUT request to install subscription queries we could also allow installing user defined relations a.k.a formulas into the system. I'm thinking something along the lines of PUT with a following payload would do the trick

{
    formula: {
       http: {
          url: "http://localhosh:8080/"
          headers: {
              content-type: 'application/vnd.ipld.dag-json'
              accept: 'application/vnd.ipld.dag-json'
          }
       }
    }
}

Where provided URL would be expected to provide following routes

POST init/
PATCH step/
PATCH end/

I would expect that things to get encoded in dag-json because:

  1. Currently entities are represented as IPLD links
  2. We'd need support for bytearrays that dag-json specifies

Alternatively we could support dag-cbor, although in practice I find it's harder to debug / inspect such systems and I think we could always adopt it in the future.

Gozala commented 14 hours ago

Considering over HTTP formulas I think it would make sense to change interface to keep it simpler there. I think solana made a right design choices by defining ABI in terms of binary transactions in that spirit I would propose modified interface.

interface Session<State> {
  state: State
  query: RangeQuery
}

interface Update<State> {
   state: State
   effect: Instruction[]
}

interface Patch<State> {
   state: State
   datums: Datum[]
}

interface Formula<State> {
   init(input: Record<string, Scalar>): Session<State>
   step(input: Patch<State>): Update<State>
   end(input: Patch<State>):  Update<State>
}
seefeldb commented 13 hours ago

Can we add a small layer of indirection here by introducing content handlers, including one for HTTP and one for WebAssembly? But also for JS or Python or so, where the isolation method is defined separately? This will become more important as we described how a formula is defined, where it comes from, etc.

The above could work, though it would help me to spell out how exactly the common case of a simple transformation would look like, and maybe how a reducer would look like, for something written in JS (assuming some amount of TBD sugar).