bloom-lang / bud

Prototype Bud runtime (Bloom Under Development)
http://bloom-lang.net
Other
855 stars 59 forks source link

schema'd tables should play nice with k/v tables #99

Open palvaro opened 13 years ago

palvaro commented 13 years ago

I'm coming around to the schema-optional feature. in addition to not requiring schemas when they're not wanted, they make it easy for interposed dataflows (eg delivery) to be agnostic to the schemas of tuples transiting through them. but for the feature to be useful, the behavior of passing schema'd tuples through k/v tuples should be intuitive.

in the program below, I would expect to be able to 'recover' the 4-tuple schema of :one in :three, and for the key columns of :one ([:a, :b]) to correspond to the key of :two ([:key]). but this is not the case: the program throws this error:

in `raise_pk_error': Key conflict inserting ["foo", "bar", ["baz", "qux"], nil] into "two": existing tuple ["foo", "bar", ["baz", "qux"]], key_cols = "foo"

{{{ module SchemaStuff include BudModule

state do interface input, :one, [:a, :b] => [:c, :d] scratch :two interface output, :three, [:a, :b, :c, :d] end

declare def logic two <= one three <= two

stdio <~ three.inspected

#stdio <~ two.inspected

end end

class ScSt include Bud include SchemaStuff end

s = ScSt.new s.one <+ [['foo', 'bar', 'baz', 'qux']] s.tick s.three.each do |t| puts "T: #{t.inspect}" end

}}}

neilconway commented 13 years ago

Well, this isn't really a bug: as they currently stand, we don't have "schema-less tables" -- we have a default schema that is supplied if you omit one from the program syntax. Hence, "schema-less tables" should behave identically to a table with an explicit [:key] => [:val] schema.

ISTM that supporting the program above would basically require allowing a collection to contain a heterogenous collection of tuples, which seems like a bad thing.

palvaro commented 13 years ago

I am ok with not calling it a bug, but I think the behavior I describe is desirable.

allowing a collection to contain a heterogenous collection of tuples

I don't agree. I am proposing that we treat 'schema not supplied' tables not as 2-arity tables but as key-value pairs whose contents may be composite. in my example, we don't prevent a user from projecting a 4-arity schema'd tuple onto a k/v table, but we do something counterintuitive: naively assume the 1st column is the key, the 2nd is the value and drop the rest on the floor. this is pretty useless behavior, whereas mapping the key columns of :one to :two.key and the value columns to :two.value makes sense and allows us to easy transit schema'd tuples through dataflows written in the kvs style (without having to pack and unpack them).

jhellerstein commented 13 years ago

Hrmm. A K/V store is a K/V store. In many cases you need to set up the K. Is the following so bad?

two <= [[one.a, one.b], [one.c, one.d]]

The alternative you propose would only work some of the time, when the RHS is a BudCollection with a schema. This probably isn't the common case -- we use lots of maps on the rhs.

That said, I'm perfectly open to revisiting the schema coercion logic as long as it is uniform and easy to describe/remember.

palvaro commented 13 years ago

The alternative you propose would only work some of the time, when the RHS is a BudCollection with a schema.

true. map expressions should probably be validated, though: they should provide no more columns than the LHS can accept. going back to my original example, let R be a 2-arity table, S be a 4-arity table, and T be a 4-arity map expression. the current behavior is to interpret R <= S as R <= [S[0], S[1]]. I am proposing treating it as R <= [[S[0], S[1]], [S[2], S[3]] -- that is, use S's schema info to resolve what appears to be a schema mismatch between lhs and rhs. something like R <= T should raise an error: this schema mismatch has no resolution . if we don't support this, we should probably reject R <= S too.

neilconway commented 13 years ago

Does this generalize? It seems a bit kludgy to me.

For example, why do we assign S[0] and S[1] to the first field of R and S[2] and S[3] to the second field? That seems completely arbitrary to me. What should we do if S has 5 or more fields? Does the semantics of the statement depend on the key columns of the LHS? That seems weird.

palvaro commented 13 years ago

this is what I am proposing:

if the arity of rhs and lhs match, do the obvious thing. if not, and RHS is an expression

if not, and RHS is a relation (with a schema)

For example, why do we assign S[0] and S[1] to the first field of R and S[2] and S[3] to the second field?

because we are following rule 3 above and this is the most reasonable thing to do (other than to raise an error, which would be better than what we do now)

neilconway commented 13 years ago

Distinguishing between expressions and relations on the RHS seems very kludgy to me: an expression should be relation-valued (or equivalently, a relation name is just the identity expression). Having "foo" on the RHS behave differently from the identity map of "foo" on the RHS is fugly, IMHO.

Anyway, even if we go down the road of treating expressions and relation identifiers differently, I'm not convinced that this generalizes in a sensible way. What do we do if the RHS has 3 key columns and 4 non-key columns, and the LHS has 5 key columns and 5 non-key columns?

palvaro commented 13 years ago

What do we do if the RHS has 3 key columns and 4 non-key columns, and the LHS has 5 key columns and 5 non-key columns?

well, according to the procedure I described above, we'd just nil-pad the result (as we already do).

to your broader point, though, perhaps the special case behavior I am asking for is only appropriate when the lhs has an unspecified (and hence 2-arity) schema. then the mapping is always obvious.

neilconway commented 13 years ago

Whoops, I mixed up RHS and LHS above -- what should we do if the RHS has 5 key columns and 5 non-key columns, and the LHS has 3 and 4?

Anyway, I think special-casing schema-unspecified tables is a kludge (why should they be special? Schema-unspecified should just macro-expand to [:key] => [:val].) You could make a reasonable case for special-casing any arity two table with a single key column, but it still seems weird to me. IMHO adding special auto-nesting-magic for certain classes of collections is probably not going to improve the readability of programs.

jhellerstein commented 13 years ago

I refuse to throw schema exceptions. We don't want to get pigeonholed as a "structured database approach". Thanks to MapReduce and KVSs, it's quite common for people to do lots of array pack/unpack logic -- we should embrace that. The risk is weird downstream logic errors resulting from simple catchable upstream type errors. But I think that's in the Ruby spirit at least. Down the road we could have schema "enforcement" be a flag at some scope -- e.g. in state declaration, or rule declaration, or even as a mode in a program.

Anyhow, short term I think I see 3 useful flexible paths here:

1) what I did already -- coerce schemas by position, and catch "extra" fields on the right as an extra "data bag" column on the left so you don't lose info. 2) if you have simple K/V on the left, you coerce the key from the right to left, and bundle the fields from the right into an array on the left. 3) Alternatively do (1) twice -- once for keys, once for columns. I.e. if the rhs has more key cols than the lhs, you bundle up the extras into an extra databag column in the lhs key. Same for values. This degenerates to (2) in the case of a single-column key on the left. Then do the same the for values.

I'm willing to allow (2) as a special case. I'm also willing to explain (3) as a general principle. Main concern is explaining it simply, and the fact that in (3) especially, positional offsets will become messy.

palvaro commented 13 years ago

not going to improve the readability of programs.

I disagree (see below; I find option 2 much easier to read and write, and unambiguous), but I'll let it go. at least grant me that arity(rhs) > arity(lhs) should raise an error rather than drop attributes on the ground.

module SomeIndirection state do interface input, :iin interface output :iout end end

module IndirectionUser state do [...] scratch :r, [:w, :x, :y] => [:z] scratch :s, [:w, :x, :y] => [:z] end

bloom do

transiting option 1:

iin <= r.map {|t| [[t.w, t.x, t.y], [t.z]]}
s <= iout.map {|t| [t.key[0], t.key[1], t.key[2], t.val[0]]}
# transiting option 2:
iin <= r
s <= iout

end end

neilconway commented 13 years ago

Pete: Couldn't you use BudCollection#keys and #values to simplify this?

iin <= r.map {|t| [t.keys, t.values]}
s <= iout.map {|t| (t.keys + t.values).flatten}
jhellerstein commented 13 years ago

shall we resolve for FaF?

neilconway commented 13 years ago

Personally I think we're fine here for FaF...

On Tuesday, March 29, 2011, jhellerstein reply@reply.github.com wrote:

shall we resolve for FaF?

Reply to this email directly or view it on GitHub: https://github.com/bloom-lang/bud/issues/99#comment_931475

Neil

palvaro commented 13 years ago

yep, untagged.

neilconway commented 13 years ago

Re: tagging this for 0.0.4, we should probably talk about what is to be fixed here. IMHO the current behavior is probably okay.