calrissian / accumulo-recipes

Recipes & cookbooks for Accumulo.
http://www.calrissian.org
Apache License 2.0
37 stars 18 forks source link

Feature Store #62

Open cjnolet opened 10 years ago

cjnolet commented 10 years ago

I've abstracted the current metrics store up a layer so that I could provide a generic store for features that have been extracted from entities. These features can include histogram of various data points used to calculate learning models, statistical summaries like the current metrics store, and other things.

The reason I wanted one set of Accumulo storage services to store various different types of features (especially on the same tablet) is so that I can have a server side iterator perform specialized analytics and correlations on different features about an entity before the data is brought back to a client.

In order to do this, I've added a new index. The current metrics store index (with group\0date in the row id) is nice when you don't expect to have too many unique types/names associated with the group. In reality, I found that having a large fabric of entities in which I'm extracting features (into the millions of unique type/name pairs) made even a batch scan over the metrics store for several metrics very slow.

The new index swaps the group and the type so that the type is in the row id and a batch scan through, say 72 hours of metrics for several group/type/name combinations can be pulled back instantly at lightening fast speeds (less than a second) with even millions of unique group/type/name combinations.

I've also stuck with keeping the index with the group in the rowId because it's much more efficient for doing batch scans (map/reduce) where I can slurp up an entire group of metrics by doing a simple range scan.

cjnolet commented 10 years ago

This is partially done. The only part that is missing now is registering a custom combine function and having that coincide with a "feature name" that gets placed in the column family (concatenated with the timeunit). The nice thing about this format is that it would co-locate all the features for a specific "type" on the same tablet, which further allows server side iterators to combine several features in different ways.