Bug w/ reduce + projection + dups

neilconway commented 12 years ago

Or rather, any duplicate-sensitive operator that is passed the output of a user-defined code block must do duplication elimination. Otherwise redirecting the output of the code block through a scratch collection would change the semantics of the program. e.g., t2 and t3 have different contents in this example, but they shouldn't.

require "bud"

class ReduceTest
  include Bud

  state do
    scratch :t1, [:v1, :v2]
    scratch :t2, [:v]
    scratch :t3, [:v]
    scratch :dummy_scratch, [:v]
  end

  bootstrap do
    t1 <= [[1, 3], [2, 2]]
  end

  bloom do
    t2 <= t1 {|t| [t.v1 + t.v2]}.reduce([[0]]) {|memo, v| [[memo[0][0] + v[0]]]}
    dummy_scratch <= t1 {|t| [t.v1 + t.v2]}
    t3 <= dummy_scratch.reduce([[0]]) {|memo, v| [[memo[0][0] + v[0]]]}
  end
end

r = ReduceTest.new
r.tick
puts r.t2.to_a.inspect
puts r.t3.to_a.inspect

sriram-srinivasan commented 12 years ago

Reduce is the only operator I can think of that can produce duplicates. Are there other operators that have this behavior?

Reduce can be forced to always produce unique tuples if we always supply Hash.new as the initial value, so out comes a hash. This can be forced syntactically on the programmer by not having reduce accept an initial value at all. That is, the only allowable syntax is coll.reduce {|hash, t| … } is the only syntax allowed.

neilconway commented 12 years ago

Well, projection can produce duplicates; reduce is sensitive to duplicates in its input. Reduce doesn't necessarily produce duplicates itself.

I don't think that limiting reduce to start with an empty hash will change this behavior: I can still pass a code block to the reduce that produces different values depending on whether there are duplicates.

Offhand, reduce is the only duplicate-sensitive operator I can think of (well, count is as well, but we already do dup elim before computing count -- i.e., we compute COUNT(DISTINCT ...) in SQL terms). So we could fix this by just doing dup elim on the input to reduce (unless it is already a known-distinct value like BudCollection).

sriram-srinivasan commented 12 years ago

On Mar 29, 2012, at 8:50 AM, Neil Conway wrote:

Well, projection can produce duplicates; reduce is sensitive to duplicates in its input. Reduce doesn't necessarily produce duplicates itself.

True. Ugh, this is an issue with all code blocks, isn't it? Seems to me that all code blocks that are not fed to a collection (non-element) will have to be uniq'd.

I don't think that limiting reduce to start with an empty hash will change this behavior: I can still pass a code block to the reduce that produces different values depending on whether there are duplicates.

Since the result of reduce is the hash object itself, which is a uniq'd set of k-v pairs, the code block can't force dups in the output.

Without code blocks, the operations that always produce uniq'd output are group, scan and reduce. Meanwhile, join, sort, each_with_index are sensitive to dups (dup input => dup output). So if there's a way to analyze code blocks for uniqueness, we may be able to identify edges in the dataflow where dup elimination may have to be introduced.

--s

neilconway commented 12 years ago

To fix reduce, shouldn't it be sufficient to just do duplicate elimination on its input? (Unless we can prove the input arrives from a known-to-be-dup-free source, like a scanner).

In the case of join, I'd argue it isn't really duplicate sensitive: dup input => dup output, but you should get the same result regardless of where duplication elimination is done (i.e., you can eliminate dups either from join input or output).

bloom-lang / bud

Bug w/ reduce + projection + dups #267