Gozala / datalogia

Library for querying in-memory facts using datalog
5 stars 1 forks source link

Abstraction for describing relations #38

Open Gozala opened 2 months ago

Gozala commented 2 months ago

Most queries had been limited to matches over [entity, attribute, value] triples and I have often found myself using hypothetical "synthetic attributes" like [?x == ?y] or [?title .startsWith "# "] in various examples.

More recently idea of relational operators had been implemented as a proof of concept https://github.com/Gozala/datalogia/pull/37

This issue is to capture notes while exploring this direction

Gozala commented 2 months ago

While working on https://github.com/Gozala/datalogia/pull/37 I have recognized that there are several interesting cases that are distinct enough that should either have different notation or some clear way to consolidate into one.

Unidirectional relations

There is a whole class of unary operators that derive output from input that can not be reversed e.g. string/case/upper derives another string, but you can not reverse it as it is a lossy.

Implication here is that expressions like [?input string/case/upper ?output] can only evaluated when ?input is bound and can not be evaluated when we have ?output an ?output and not an ?input.

Bidirectional relations

Whole class of operations unary operations can be reversed, basically all of the uniry math and binary operators fall into this. Also any kind of lossless encoding like string/from/utf8, string/to/utf8 or ==.

Unlike unidirectional relations when we have [?x == ?y] it does not matter which out of the two variables is bound we could go in either direction here so order of clause execution is irrelevant.

Predicates / Constraints

There is also whole set of binary operators (as in it has two parameters) that are affectively predicates where we do not care about the output beyond it having one. For example text.contains(segment) would be a lot more natural to express as [text, 'string/contains', segment] as opposed to [{ text, segment }, 'string/contains', true].

Although similar to unidirectional relations these are lossy and therefor can only execute if left operand is bound.

Gozala commented 2 months ago

It is worth recognizing that there is something here resembling propagators, that do offer ways to propagate constraints in either direction and lets cells handle the unification of constraints. That is to say perhaps we could model variables more like cells in the propagators that way we could evaluate both bidirectional relations and predicates even when we do not have data on the left side.

On the other hand it is not clear what the result of the query should be in the face of partiality, that is when we only know substrings of ?text, but not the text itself.

In other words we may have queries that do not produce bindings for certain variables, although it is worth calling out that you could already do something like that if you have some variables that only get used in some disjuncts and not the other.

Gozala commented 2 months ago

Relatedly I have also being realizing that more classical Datalog systems can express this relations as rules that are derived and persisted. Datomic on the other hand uses rules only to derive ephemeral relations that is they are not stored and are only used for queries. All writes instead occur through transactions.

It feels like we want some hybrid when used with UI applications. So that in queries all the derived relations are ephemeral to query, yet running application runtime could transact those to be persisted.

Gozala commented 2 months ago

Another thought that crossed my mind was that perhaps attributes should be simply modeled as bidirectional relations in the above topology rather than being a special case.

Gozala commented 2 months ago

I have sketched out a rough interface of what formula's in WASM could look like. Intention is to allow formulas to model arbitrary relations e.g. one could say [?item rga.list ?item-list] and rga/list here would a program that provides Formula interface.

use std::collections::HashMap;

// Define the Scalar type as an enum
#[derive(Clone, Debug)]
enum Scalar {
    Null,
    Boolean(bool),
    String(String),
    Float(f64),
    Int(i64),
    Bytes(Vec<u8>),
    Entity(Entity),
}

// Define the Entity struct
#[derive(Clone, Debug)]
struct Entity(Vec<u8>);

// Define the Datum type as a tuple struct
#[derive(Clone, Debug)]
struct Datum {
    entity: Entity,
    attribute: String,
    value: Scalar,
    cause: Entity,
}

// DB range query type
#[derive(Clone, Debug)]
enum RangeQuery {
    ByEntity { entity: Entity, attribute: Option<String>, value: Option<Scalar> },
    ByAttribute { attribute: String, entity: Option<Entity>, value: Option<Scalar> },
    ByValue { value: Scalar, entity: Option<Entity>, attribute: Option<String> },
}

// Define the Formula trait
trait Formula {
    // Input is always a set of named scalars. Could be a record but I'm not sure how to
    // enforce that every field must be a scalar.
    type Input: Vec<(String, Scalar)>;
    // Output is also a set of named scalars, here as well it could be a record if we can
    // enforce constraint that all fields be scalars.
    type Output: Vec<(String, Scalar)>;
    // Local formula state used through the the transformation process.
    type State;

    // Init is called by the query engine when corresponig clause
    // is encountered. Here implementation can produce initial state
    // range query to request datums it will process.
    fn init(input: Self::Input) -> (Self::State, RangeQuery);

    // Step is called with set of datums that matched the range query. It may be called
    // iteratively passing batches of datums. At each call implementation can produce
    // derived set of facts and new state to be passed in subsequent step.
    fn step(state: &Self::State, datums: Vec<Datum>) -> (Self::State, Vec<Self::Output>);

    // Once all datums had been stepped through method is called allowing implementation
    // to flush out all the remaining datums if any.
    fn end(state: &Self::State) -> Vec<Self::Output>;
}
Gozala commented 2 months ago

One idea that came to my mind after talking about this with @jsantell about the fact that integrating wasm modules into node would be a tricky business, is that same interface could be provided by an HTTP REST interface and perhaps that is a better way to model it from the get go. If things can run in a same memory space we simply can skip encode/decode steps.

Gozala commented 2 months ago

I should also point out that above Formula definition is probably limited enough that implementing actual RGA would not be viable, because RGA requires recursive traversal which probably can not be captured in a single range query. So the sketch needs to be revisited if we actually want to implement rga as a formula.