Optimized hash-like data structure for collections

neilconway commented 12 years ago

We currently use a Ruby Hash to map keys to tuples in Bud collections. This works fine, but there is room to optimize Bud collections by using a custom hash-like type:

We currently store [key_cols] => tuple mappings. This is redundant, since the key column values can be extracted from the tuple when needed -- you'd just want custom hash and equality functions that would consult the collection schema to find the columns to hash on.
Extracting the key columns on every insertion could then also be avoided.
The current code extracts the key columns, checks for a matching entry in the hash table, and only then does an insertion. If the underlying storage provided an "upsert"-like method, we could avoid the need to probe the storage twice.

In the grand scheme of things, this isn't a priority but might be fun.

sriram-srinivasan commented 12 years ago

Custom hash and equality functions on the tuple don't work because the tuple doesn't know its keycols. It cannot know its keycols because we pass a tuple around by ref.

Custom hash and equality functions in the collection don't work well; I experimented with a set data structure written in Ruby with schema-specific hash and equals support, but it doesn't match the built-in Ruby hash for performance.

The fastest method I have seen so far (staying in pure Ruby) is to wrap the tuple with an object with schema-specific hash and equals, but which doesn't extract the fields. This is faster than our current method of creating an array: on jruby it is twice is fast, but marginal (10%) on ruby 1.8.7 and 1.9.3

Given, table :foo, [:k1, :k2] => [:v]

generate a key class as follows

class Foo_Key 
      def initialize(tup) 
          @tup = tup
      end
      def hash
           @hash = @tup.k1.hash * 31 + @tup.k2.hash // good to cache the hashcode.
      end
      def ==(other)
           other.class <= Foo_Key and @tup.k1 = other.tup.k1 and @tup.k2 == other.tup.k2
      end
end

neilconway commented 12 years ago

By "custom hash and equality functions", I was assuming we'd be able to control the hash/equality function APIs, so it would be easy to arrange to pass a reference to the hash table/collection when invoking the function. That's what I did in C4.

Right -- to beat the performance of the builtin Hash, we'd probably want a C extension.

bloom-lang / bud

Optimized hash-like data structure for collections #269