dkubb / axiom

Simplifies querying of structured data using relational algebra
https://github.com/dkubb/axiom
MIT License
459 stars 22 forks source link

Sort for nil values #26

Closed paneq closed 10 years ago

paneq commented 11 years ago
relation = Veritas::Relation.new(
  [ [ :id, String ], [ :name, String ], [ :color, String ], [ :weight, Float ], [ :city, String ] ],
  [
    [ 'P1', 'Nut',   'Red',   12.0, 'London' ],
    [ 'P2', 'Bolt',  'Green', 17.0, 'Paris'  ],
    [ 'P3', 'Screw', 'Blue',  17.0, 'Oslo'   ],
    [ 'P4', 'Screw', 'Red',   14.0, 'London' ],
    [ 'P5', 'Cam',   'Blue',  12.0, 'Paris'  ],
    [ 'P6', 'Cog',   'Red',   19.0, nil ],
  ]
)

ordered = relation.summarize( relation.project([:city]) ){ |r| r.add(:sum, r.weight.sum) }.sort_by do |r| 
  [r.city.asc, r.sum.asc]
end.to_a

This leads to an exception:

NoMethodError: undefined method `nonzero?' for nil:NilClass
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:95:in `block in cmp_tuples'
    from veritas-0.0.7/lib/veritas/relation/header.rb:124:in `block in each'
    from veritas-0.0.7/lib/veritas/relation/header.rb:124:in `each'
    from veritas-0.0.7/lib/veritas/relation/header.rb:124:in `each'
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:94:in `reduce'
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:94:in `cmp_tuples'
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:73:in `block in sort_tuples'
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:73:in `sort'
    from veritas-0.0.7/lib/veritas/relation/operation/order/direction_set.rb:73:in `sort_tuples'
    from veritas-0.0.7/lib/veritas/relation/operation/order.rb:92:in `each'

I think nil values should always be sorted as "greater" than any non-nil value. What do you think ?

paneq commented 11 years ago

Here is my current workaround:

Veritas::Relation::Operation::Order::Ascending.class_eval do
  def self.call(left, right)
    left <=> right || (-1 if right.nil?) || (1 if left.nil?)
  end
end
Veritas::Relation::Operation::Order::Descending.class_eval do
  def self.call(left, right)
    left, right = right, left
    left <=> right || (-1 if right.nil?) || (1 if left.nil?)
  end
end
paneq commented 11 years ago

Oh, And I checked that the problem occurs also in master branch

dkubb commented 11 years ago

Sorry for not responding to this earlier. I'm still trying to sort out precisely how to handle this.

Officially relational algebra doesn't really provide many ways to handle NULL (nil) values. In fact it's pretty much regarded that NULL values are the result of a mistake at the data modelling phase. I think in reality there are times where it does happen, and we have to make sure we handle it sanely.

The problem is that some operations have undefined behaviour when a NULL value occurs. You can fall back to ruby-like semantics, like that nil won't equal anything other than another nil. But there are other cases where it's not so clear-cut. The worst possible outcome is that you have to rely on 3 valued logic, as opposed to boolean logic, which makes everything significantly more complex.

What I need to do is look at every operation and see if there's a nice way to handle it, or if I have to raise assertions when it occurs.

paneq commented 11 years ago

I totally understand. IMHO we should handle things the same way SQL-databases does. That is what I expected at least. Also, this ticket is not about trying to fix it everywhere. Step by step is better approach. I only ask you to consider fixing it for sort_by. Thank for writing this gem and for deeply considering this issue.

ahamid commented 11 years ago

FWIW I don't think SQL (and SQL database products) specifies a standard semantic for ordering nulls either http://en.wikipedia.org/wiki/Order_by

dkubb commented 11 years ago

@ahamid yeah, some databases sort nulls first, some sort them last. Some will default to one, but provide a non-standard way to specify another.

Just handling nil on sorting isn't a proper solution because it pushes the problem elsewhere.

I'm still trying to decide if I'll provide a way to specify the preference, or if I'll just provide a way to specify default values for nil columns. Maybe it could be something like:

relation_with_nil_city.defaults(:city => '').sort_by { |r| [r.city.asc, r.sum.asc] }

Or maybe I should just say to restrict the tuples with a nil city, and that if you have nil values the behaviour is undefined at the moment.

It doesn't sound like a big deal, but handling of NULL values is probably the largest unsolved issue in relational algebra. The SQL approach is completely out, since makes things impossibly more complex (to the point where I would say no database gets it perfectly right, despite 30+ years of research and billions of dollars in investment). I don't know what the solution is, but I'll need to do more research to see if anyone has a good solution for this.

paneq commented 11 years ago

Everything is better than current exception NoMethodError: undefined methodnonzero?' for nil:NilClass`.

That would be also cool:

relation.sort_by { |r| [r.city.asc.default(""), r.sum.asc] }
dkubb commented 10 years ago

Sorry for leaving this open for so long.

I think I'm going to use the default behaviour of PostgreSQL as a guide since it is our reference database on the RDBMS side of things. This means defaulting to sorting nil last.

dkubb commented 10 years ago

@paneq thanks for submitting this. Your original use case passes:

relation = Axiom::Relation.new(
  [ [ :id, String ], [ :name, String ], [ :color, String ], [ :weight, Float ], [ :city, String ] ],
  [
    [ 'P1', 'Nut',   'Red',   12.0, 'London' ],
    [ 'P2', 'Bolt',  'Green', 17.0, 'Paris'  ],
    [ 'P3', 'Screw', 'Blue',  17.0, 'Oslo'   ],
    [ 'P4', 'Screw', 'Red',   14.0, 'London' ],
    [ 'P5', 'Cam',   'Blue',  12.0, 'Paris'  ],
    [ 'P6', 'Cog',   'Red',   19.0, nil ],
  ]
)

ordered = relation.summarize( relation.project([:city]) ){ |r| r.add(:sum, r.weight.sum) }.sort_by do |r| 
  [r.city.asc, r.sum.asc]
end.to_a

=> [{:city=>"London", :sum=>26.0},
 {:city=>"Oslo", :sum=>17.0},
 {:city=>"Paris", :sum=>29.0},
 {:city=>nil, :sum=>19.0}]
paneq commented 10 years ago

@dkubb Nice to hear that. Thank you for your work on that.