calrissian / accumulo-recipes

Recipes & cookbooks for Accumulo.
http://www.calrissian.org
Apache License 2.0
37 stars 18 forks source link

[QFDStores] Provide a mechanism for dumping out full nested schemas #280

Open cjnolet opened 9 years ago

cjnolet commented 9 years ago

The JSONTupleStore class gave us the ability to ingest documents from raw, arbitrarily nested json. The KeyValueIndex provides both cardinality information and key information based on the types of objects. What it does not provide is information about the nesting of objects which may have been flattened.

The direction I can see this going is an optional key in the index table (or maybe even a different table called storeType_schema) that would allow the schemas to be aggregated together as events are being persisted. A good example of how this is possible is in the StructType class in Spark SQL. This class basically holds a Set[StructField]. A StructField can be another StructType with its own StructField objects. This object gets easily marshalled to and from json and would be a great starting point for a separate schema table.

The nice thing about the schemas as well is that they can be merged togethe with a simple set union.

cjnolet commented 9 years ago

I think this may be good for something like a "schemaStore". Once a tupleStore has been flattened- it wouldn't take much to introspect the flattened object and figure out the nesting points.

cjnolet commented 9 years ago

I have this coded up in scala, actually. Probably could stay that way considering the format of the schemas depend on the Spark API. It's got an Accumulo combiner which will merge two schemas together. There's a utility class that will find conflicts- that is, the cases where something went from a primitive value like a String and was turned into a container type like an Array or Object or vice versa.

cjnolet commented 9 years ago

The more I've thought about this- I'm realizing that the only difference between Spark's StructType (schema) object and our being able to look up the keys for an entity/event type in the index table is the fact that spark's struct type will show which portions of keys are arrays/collections. I think we may be able to do something very similar on our own without being tied to Spark- but at the same time, it may not matter from the SQL layer (depending on the queries we need to do).