enspirit / bmg

Bmg, Alf's successor, A Ruby Relational Algebra
MIT License
234 stars 8 forks source link

InMemory::Relation and SQL::Relation return different query results #21

Closed amw-zero closed 3 years ago

amw-zero commented 4 years ago

I came across a surprising behavior, and I wanted to verify that it is in fact a bug before attempting to fix (I'd be more than happy to submit a patch if this is agreed to be a bug). The bug is that the same query run on an InMemory::Relation and a Sql::Relation returns different results.

Consider the following relations:

transactions = Bmg::Relation.new([
  { id: 11, amount: 100, name: 'First Paycheck' },
  { id: 12, amount: 200, name: 'Second Paycheck' },
  { id: 13, amount: -100, name: 'Internet Bill' },
])

tags = Bmg::Relation.new([
  { id: 1, transaction_id: 11, name: 'income' },
  { id: 2, transaction_id: 12, name: 'income' },
  { id: 3, transaction_id: 13, name: 'bill' },
])

transactions is modeling some financial transactions, and tags is modeling the ability to tag a transaction with some semantic name, like "income." I'm looking to query all transactions for a given tag. I expressed this with the following query:

transactions.join(tags.restrict(name: 'income'), { id: :transaction_id }).

Running this on these in-memory relations returns the right result:

[
  {:id=>11, :name=>"First Paycheck", :amount=>100}
  {:id=>12, :name=>"Second Paycheck", :amount=>200}
]

I get back the two transactions who are tagged as "income."

If you create the same relation in postgres and run the query on relations produced with Bmg.sequel(:transactions, ...), the query returns no results. This is because it produces the following SQL:

SELECT "t1"."id", "t1"."amount", "t1"."name" 
FROM "transactions" AS "t1" 
INNER JOIN "tags" AS "t2" ON ("t1"."id" = "t2"."id")
WHERE ("t2"."name" = 'income')

If you look at the join clause, the ON condition is: (t1.id = t2.id), but by specifying { id: :transaction_id } in the BMG join operator, I expected the produced ON condition to be: (t1.id = t2.transaction_id).

I can get the produced SQL to be what I expect by reversing the join, i.e. by doing:

tags.restrict(name: 'income').join(transactions, { transaction_id: :id }).

But because joins in relational algebra are commutative, I would expect both of these to produce the same result. I may be misunderstanding the Hash parameter that gets passed into the join operator though.

I created a repo that sets up and exposes the bug if someone would like to play around with it in more depth: https://github.com/amw-zero/bmg_query_bug.

Is this a bug or am I using the join operator incorrectly?

amw-zero commented 4 years ago

@blambeau any thoughts here?

blambeau commented 4 years ago

@amw-zero first of all, sorry for the delay. I haven’t received a notification or email when you opened the PR.

It’s (most probably) a bug, probably related to the syntactic sugar allowing to pass a join hash. The thing is that the key => value order in such hash is both very important ... and very counterintuitive

I don’t even remember myself what is the correct order, but I bet the second is right, not the first one. Since it’s syntactic sugar for:

transactions.join(tags.restrict(name: 'income').allbut([:id]).rename(:transaction_id => :id)).

Can you tell me if that works well?

Still, that does NOT explain why the results are different (except possibly if the query is actually wrong, since behavior is undefined in such case). A few tools that can help you get to the bottom of it:

r.debug   # can be used anywhere to inspect the query tree
r.with_typecheck # not sure exactly (from memory), but the feature exists and is supposed to detect wrong queries and complain
amw-zero commented 4 years ago

No problem about the delay, I figured that was the case.

The query that you suggested didn't exactly work, but works with a tiny modification (explicitly specifying the column to join on, i.e. [:id]). And more importantly it made me realize what the problem with my initial query was: both relations have an id column. You used .allbut([:id]) on the tags relation so that transaction_id could be renamed to id without a collision, and that name collision seems to be what's causing the problem with renaming.

This is the query that's working just fine for me now:

transactions.join(
  tags.restrict(name: 'income').allbut([:id]), 
  { id: :transaction_id }
)

I do agree though, ideally a query returns the same result no matter what the backend is (memory, sql, etc.). In my use case, I'm using memory relations during testing to test the logic of my queries. Doing this has been very convenient so far, since I don't need to create arbitrary repository objects. But this was the first problem I ran into where the memory relation and sql relations were inconsistent.

blambeau commented 4 years ago

Hi @amw-zero, I wrote https://github.com/enspirit/bmg/pull/22 to make sure that our guesses and findings are right. They are.

  1. The original query is actually wrong (rename is not expected to override an existing attribute). Indeed, in such case, the SQL and Memory results are different. Possibly that's a problem.

  2. Various corrected approaches work fine. I suggest you to execute the rspec and look at the resulting SQL. You might be surprised to see the generated DISTINCT with a join strategy, and could be interested in the matching approach instead.

  3. I checked that the type checker complains about the original query: it does, confirming 1.

I'm not completely sure it's worth fixing the bug. Yet, possibly the join/matching syntactic sugar could go a bit further when attributes are known, and generate the necessary allbut for you. WDYT?

amw-zero commented 4 years ago

That PR is extremely helpful. Many thanks. I was unaware of the typechecking capability, and since that catches the error I think that is totally fine. I can arrange my tests so that they type check the memory relations and hopefully that will catch an error like this earlier on.

Ultimately I was just writing an ambiguous query! So I do agree that fixing the "bug" may not be worth it. This case is quite common (joining 2 tables with an id column), but implicitly adding the allbut might be too overbearing.

Thanks so much for the help! I'm really enjoying using the library so far, and will definitely report any other issues that I find.