dkubb / axiom

Simplifies querying of structured data using relational algebra
https://github.com/dkubb/axiom
MIT License
459 stars 22 forks source link

Misinterpretation of the algebra for JOIN operations? #5

Closed d11wtq closed 12 years ago

d11wtq commented 12 years ago

Problem One (possibly the only issue here is unnecessary validation?).

Joining relations with no common attributes doesn't work due to an assertion in Veritas:

# say, users
r = Veritas::Relation.new(
  [ ["UNO", Integer], ["USERNAME", String] ],
  [
    [42, "Bob"]
  ]
)

# say, pets
s = Veritas::Relation.new(
  [ ["PNO", Integer], ["PNAME", String] ],
  [
    [1, "Mongo"]
  ]
)

r.join(s) #=> Veritas::InvalidHeaderError: the headers must have common attributes

From page 88 of Database in Depth:

r JOIN s

is a relation with (a) a heading that is the (set-theoretic) union of the headings of r and s and (b) a body consisting of the set of all tuples t such that t is the (set-theoretic) union of a tuple appearing in r and a tuple appearing in s.

(snip)

Third (also important), cartesian product is a special case of join, too: if n = 0, meaning there are no Y's and thus r and s have no common attributes at all, r JOIN s degenerates to r TIMES s .... snip ... Why? Because (a) the set of attributes common to r and s in this case is the empty set of attributes; (b) every possible tuple has the same value for the empty set of attributes; thus every tuple in r joins to every tuple in s, and so we get the cartesian product as stated.

Problem two (possibly solved simply by fixing the first issue):

TABLE_DEE joined with any relation, r should always yield the relation r.

Veritas::Relation.new([], [ [] ]).join(r) # same error is previously, but should simply return r

From page 89 (pre-amble about a prefix notation for Tutorial D's JOIN):

For example, the join of parts and suppliers could alternatively be expressed as follows:

JOIN { P, S }

(snip)

The join of a single relation, JOIN { r }, is just r itself; this case is perhaps not of much practical importance. Perhaps surprisingly, however, the join of no relations at all, JOIN {}, is very important indeed!and the result is TABLE_DEE.

(snip)

  • In like manner, 1 is the identity with respect to "*"; that is, for all numbers x, the expressions x * 1 and 1 * x are both identically equal to x. As a consequence, the product of no numbers is 1.
  • In the relational algebra, _TABLEDEE is the identity with respect to JOIN; that is, the join of any relation r with TABLE_DEE is simply r. As a consequence, the join of no relations is TABLE_DEE.

Let me know if I've got the wrong end of the stick or if this is indeed a bug :) I'm off to bed!

dkubb commented 12 years ago

I think you're right. Right now I only allow join between two relations where some of the attributes overlap. It raise an exception when none of the attributes overlap (product) and when all of the attributes overlap (intersection).

Granted, using the proper operation will be more efficient, I should probably just allow join to be used in all 3 cases and then let the optimizer select the most appropriate one given the headers from both relations.

To fix this I'm going to see about removing the validations, and then see if the algorithm I used for join is general purpose enough to handle all 3 cases. I can't recall if I optimized it for the one case it handles or not. Either way it should be relatively simple to change if need be.

dkubb commented 12 years ago

Also, I should note, one of the advantages to using join for all 3 cases is that as the schemas change, assuming same-named attributes refer to the same thing, the operation should "just work". Compare this with other approaches, like those used in SQL, if you add columns to both tables every query needs to be changed.

dkubb commented 12 years ago

BTW Veritas::TABLE_DEE and Veritas::TABLE_DUM are defined as constants.

d11wtq commented 12 years ago

Absolutely, it would be simpler and more correct. The requirement for the unique key ensures the "expected" thing is done for most cases. I think only once in my career have I ever needed a true natural (cross) join without any key-joining and I can't really recall what it was (it was for some one-off data crunching we were doing). Either way, making the engine do the academically defined thing is definitely better than not :P

Thanks for the tip on the constants. I assumed there would be constants somewhere, but was just doing it the long-way around for learning purposes :)

dkubb commented 12 years ago

@d11wtq btw, this has been resolved in https://github.com/dkubb/veritas/commit/e8c77ba130e4c92d02707fd1bb6fe6488ae4579e