Gozala / datalogia

Library for querying in-memory facts using datalog
5 stars 1 forks source link

Support for aggregators #28

Open Gozala opened 1 month ago

Gozala commented 1 month ago

Aggregators are not only convenient but necessary to do certain things e.g. count matching relations or collect them into a collection. SICP does not have this (unless I've missed it) which is why it was not implemented here.

Thinking more about it I think I have an implementation idea that fits with SICP design.

  1. It would be an evaluateAggregate path inside the evaluator.
  2. It will consume all frames and either emit 1 with aggregate members or none if there was a miss-match.
  3. It will extend a Clause with {Aggregate: Aggregate} where Aggregate is like

    type Aggregate<T, U> = {
        out: Var<U>
        input: Var<T>
    
        state: U
        merge: (state: U, input: T) => Result<U>
     }
  4. Evaluator will feed input from every frame and compute output.
  5. Query planner will need to sort clauses such that aggregates are handled after corresponding inputs otherwise aggregation will fail.

Unknowns

It is not clear to me what to do with variables that got aggregated. If they are in the selector that would imply that we need to produce frame per each variable, but if they are not in the selector it would make sense to produce only one frame instead.

Should look into datomic / datascript to see what they do here

Gozala commented 1 month ago

Just realized that datomic does not have aggregators in :where clause and instead has them in:find. This is interesting because it eliminates the problem in queries like

const x = DB.variable()
const xs = DB.varibale()

export query = {
   select: { x, xs },
   where: [
       DB.match([x, 'type', 'list/item']),
       DB.aggregate.count(x, xs)
   ]
}

Because they become more like this

const x = DB.variable()
const xs = DB.varibale()

export const query = {
   select: { x, xs: DB.aggregate.count(x) },
   where: [
       DB.match([x, 'type', 'list/item'])
   ]
}

Howover in datomic when aggregation is used other bindings are used for grouping

Gozala commented 1 month ago

I encountered interesting flow in my design here is an example dataset

const groceries = DB.Link.of({ name: 'Groceries' })
const milk = DB.Link.of({ title: 'Buy Milk' })
const eggs = DB.Link.of({ title: 'Buy Eggs' })
const bread = DB.Link.of({ title: 'Buy Bread' })

const chores = DB.Link.of({ name: 'Chores' })
const laundry = DB.Link.of({ title: 'Do Laundry' })
const dishes = DB.Link.of({ title: 'Do Dishes' })

const db = DB.Memory.create([
  [groceries, 'name', 'Groceries'],
  [milk, 'title', 'Buy Milk'],
  [eggs, 'title', 'Buy Eggs'],
  [bread, 'title', 'Buy Bread'],
  [groceries, 'todo', milk],
  [groceries, 'todo', eggs],
  [groceries, 'todo', bread],
  [chores, 'name', 'Chores'],
  [laundry, 'title', 'Do Laundry'],
  [dishes, 'title', 'Do Dishes'],
  [chores, 'todo', laundry],
  [chores, 'todo', dishes],
])

If we run query like this on it

const list = DB.link()
const item = DB.link()
const title = DB.string()
const name = DB.string()
const todo = DB.variable()

DB.query(db, {
  select: {
    list,
    name,
    todo,
  },
  where: [
    DB.match([list, 'name', name]),
    DB.match([list, 'todo', item]),
    DB.match([item, 'title', title]),
    includes(todo, item),
  ],
})

We are not going to get

[
  { list: groceries, name: 'Groceries', todo: [milk, egg, bread] },
  { list: chores, name: 'Chores', todo: [laundry, dishes]
]

Instead we get

{
  { list: chores, name: 'Chores', todo: [milk, egg, bread, laundry, dishes] },
}

Although in datascript with this query

[:find ?list ?name (distinct ?child)
   :where
   [?list "name" ?name]
   [?list "todo" ?item]
   [?item "title", ?title]
   [(tuple ?item ?title) ?child]]

We get following as proper grouping takes place

[
  [1 "Groceries" [[2 "Buy Milk"] [3 "Buy Eggs"] [4 "Buy Bread"]]]
  [5 "Chores" [[6 "Do Laundry"] [7 "Do Dishes"]]]
]
Gozala commented 1 month ago

I think datomic attempts to unify frames except for the member that is aggregated, this also explains why :find ?list ?name (distinct ?child) ?title produces distinct results as frames do not unify

[
  [1 "Groceries" [[2 "Buy Milk"]] "Buy Milk"]
  [1 "Groceries" [[3 "Buy Eggs"]] "Buy Eggs"]
  [1 "Groceries" [[4 "Buy Bread"]] "Buy Bread"]
  [5 "Chores" [[6 "Do Laundry"]] "Do Laundry"]
  [5 "Chores" [[7 "Do Dishes"]] "Do Dishes"]
]
Gozala commented 1 month ago

We can copy datomic here and push aggregation into select or we can make our aggregation where clause do frame unification, although I'm not sure if that is how it would work in regular datalog.

Gozala commented 1 month ago

I end up implementing simple solution for now that looks like this

DB.query(db, {
  select: {
    id: list,
    name,
    todo: [{id: item, title}]
  },
  where: [
    DB.match([list, 'name', name]),
    DB.match([list, 'todo', item]),
    DB.match([item, 'title', title]),
  ],
})

ℹ️ An array wrapper [{id: item, title}] in the selector implies that matches must be collected

Gozala commented 1 month ago

@seefeldb brought up an interesting question about modeling empty collections. Which in would typically manifest as either [] or undefined. Here however as illustrated by #35 above query would produce no results because second where clause would eliminate prior matches.

This is really annoying and tricky because:

  1. We did aggregation in selector (like datomic) as opposed to where clause) and there is no obvious way to make it return [] because selection happens after matching, meaning frames will be eliminated before selection runs.
    • We could in theory add more where clause to deal with that, but it becomes here even in this example because all of the clause containing item would need to be moved into optional group.
    • In contrast if we were to proceed with aggregation in where clause as I was pursuing originally we would had to be explicit about optionality.
Gozala commented 1 month ago

Perhaps way to go about it would be to develop support for datomic pull like query interface. That would potentially give us more flexibility on how to interpret things e.g. same example in pull inspired notation could look something like

[
   "id": {},
   "name": {},
   "todo": {
      "title": {
          "default": "untitled"
       },
       "done": {
          "default": false,
       }
   }
]

Reverse lookups are kind of tricky in such syntax and other more complicated cases, but it would make empty aggregates and alike far more simple to model