Support for Parquet format instead of just SequentialFile for HDFS internal data

OryxProject / oryx

Oryx 2: Lambda architecture on Apache Spark, Apache Kafka for real-time large scale machine learning

http://oryx.io

Apache License 2.0

1.79k stars 405 forks source link

Support for Parquet format instead of just SequentialFile for HDFS internal data #193

Closed dsdinter closed 9 years ago

dsdinter commented 9 years ago

Hi, Just wondering if this may benefit performance in Speed and Batch layers for some specific use cases (I.e. Feature vectors for ALS) as it leverages columnar store format over sparce matrices.

Happy to help as I would love to implement this framework with my current client.

Thanks.

srowen commented 9 years ago

Hm, I don't think there would be a speed improvement since the algorithms don't consume sparse data directly from storage. For example ALS wants (user,item,rating) tuples in memory. This usage is purely internal too, so it would not for example be reading a subset of a much larger file with many columns. I don't know if that's good or bad news -- probably wouldn't help, but that means probably not something to worry about implementing either.

dsdinter commented 9 years ago

You are right, also going back to the ALS use case, I am not seeing too much benefit of compression with the type of values you would expect within feature vectors. It may benefit though if you want to implement a new app around Recommendation Engines, i.e. Content Based before using ALS, on where you may want to filter based on some reference dataset (geography, industry or type of user). You cache the RDD as Dataframe in memory to query from serving layer, which I assume will enable columnar format in memory as described in SchemaRDD for Spark SQL, but you will still need to keep the dataframe in some format in HDFS, i.e. Parquet.

Anyway, I am probably mixing use cases here, just wanted to see how you would implement combining other apps for use cases.

Thanks!

srowen commented 9 years ago

Yeah, the idea here is that it's a framework in which you could implement a new batch or speed or serving component to mix and match with others. If you had a much more complex type of data coming across the kafka queue, you might want to write in something like Spark SQL in the batch layer to process it, yes, and then turn around and save it in parquet format "internally" in HDFS.

That might require some framework changes, small ones I think. I suppose I am waiting for someone with that use case to help me figure out what it has to do. For example, I think the bit where it saves with saveAsNewAPIHadoopFile has to generalize.