User-Defined Aggregate modules

bloom-lang / bud

Prototype Bud runtime (Bloom Under Development)

http://bloom-lang.net

Other

854 stars 59 forks source link

User-Defined Aggregate modules #275

Open jhellerstein opened 12 years ago

jhellerstein commented 12 years ago

Assume I have a module Agger that takes a set of tuples as input and produces a single aggregate tuple as an output via some encapsulated logic.

Now, I have a set S that I'd like to partition on its first column, and evaluate Agger once per partition -- essentially use Agger as the aggregate in a group by. There's no way to do that right now.

Here's an example. It currently produces the tuple with the highest key. I'd like to modify it to group by val, and produce the highest-keyed tuple per value (one for :thing and one for :thang). I want to do this without breaking the encapsulation on Agger-- i.e. no fair changing the argmax clause.

require 'rubygems'
require 'bud'

module Agger
  state do
    interface :input, :stuff
    interface :output, :best
  end
  bloom do
    best <= stuff.argmax([], :key)
  end
end

class AggTest
  include Bud
  include Agger

  bootstrap do
    stuff <+ [[1,:thing], [2,:thang], [3,:thang], [4,:thang]]
  end

  bloom do
    stdio <~ best.inspected
  end
end

a = AggTest.new
a.tick

neilconway commented 12 years ago

This could be done via a more powerful module/import system: you'd need to be able to introduce new imports at runtime, in a data-dependent manner.

For now, would it suffice to just make the argmax over an extra field in the tuple? I realize that breaks the encapsulation, but it essentially permits user-defined grouping (the user can either pass a fixed value for the field to get a single output group, or encode their partitioning scheme as distinct field values).

jhellerstein commented 12 years ago

The workaround isn't much help in practice, as my module is dozens of lines long and has to pass the external grouping attribute through a whole lot of logic.

It probably shouldn't be part of the import system in a naive way, as that would essentially generate an import instance per group, with the group names being in the import namespace rather than in the data where they belong (for subsequent joining, shipping across the net, etc.)

We want some kind of lambda here where the outer grouping is parameterized by an inner aggregation function.

neilconway commented 12 years ago

I think what you describe at the end is equivalent to what I was suggesting: you basically many independent instances of an operator (argmax in this case), where the number of instances/partitioning scheme depends on the data. import does precisely the same thing, except that the number of instances/partitioning scheme is fixed in the program text. As far as fetching the partition name, you can imagine adding a builtin function to return the name/ID of the current module.

jhellerstein commented 12 years ago

this sounds like fodder for a wholesale rethink of the module system, in which the namespace is reified in the data.