Comparisons with KV stores (Redis and Memcached)

debasish83 commented 9 years ago

Hi Ankur,

We are hitting runtime issues in using SparkSQL as a distributed kv store through jdbc API. From our schema we can define a composite key and would like to run multiget on key which can be exposed through akka based REST API.

IndexedRDD was written for such use-cases and I would like to use it because our models are ML based.

It will be great if some comparisons with IndexedRDD with popular options like Redis/Memcached can be shown. I saw in your summit talk you showed cassandra comparisons but in our use-case the persistance is guaranteed by batch spark jobs in hive metastore and we need a KV store with fast read speed and modest write speed.

Something like Redis is a good comparison to see since it supports fast read and modest writes. I would like to know your feedbacks before I start setting up the Spark SQL as KV store, IndexedRDD and Redis comparisons.

Thanks. Deb

ankurdave commented 9 years ago

Hi Deb,

As you said, that comparison focused on IndexedRDD's read-modify-write performance against a write-optimized store (Cassandra). I haven't directly measured lookup performance across systems, but I would expect it to be comparable to Redis since there is no penalty for immutability when reading. The difference would instead come down to the choice of data structure (radix tree vs. hash table) and as I mentioned in the talk, they are similar in lookup performance depending on the workload.

debasish83 commented 9 years ago

Right now in my experiments I am finding it slower compared to Spark SQL. None of the dataset is cached. Is this expected ?

ankurdave commented 9 years ago

Do you mean you're creating an IndexedRDD instance for each get/multiget -- something like IndexedRDD(rdd).multiget(...)? In this case the overhead of building the index would probably dominate.

If you can cache the IndexedRDD, it would amortize the index building cost over the lookups. Spark SQL may also get a lot faster since it will store the data in compressed columnar format, but for enough lookups and a large enough dataset size, IndexedRDD should be faster.

debasish83 commented 9 years ago

Example query: Spark SQL select * from table where col1="x" and col2="y" limit 1;

IndexedRDD val indexedkv = IndexedRDD(df.map{row=>(row.col1+row.col2,value)}) indexedkv.lookup("x" + "y")

I will be surprised if redis is not runinng faster than SparkSQL/IndexedRDD for this use-case.

debasish83 commented 9 years ago

I will run the same experiments with caching...I am not yet sure if Redis caches all the data in memory or not...Most likely it will for read optimization

ankurdave commented 9 years ago

Also, you have to use get/multiget for IndexedRDD. I haven't actually overridden PairRDDFunctions.lookup, so it still does a full scan.

debasish83 commented 9 years ago

Also what if I only want to cache the index and not the whole data into memory...is that supported ?

ankurdave commented 9 years ago

That's a neat idea, but unfortunately it's not supported.

debasish83 commented 9 years ago

How do I cache the indexedRdd ? val indexedkv = IndexedRDD(df.map{row=>(row.col1+row.col2,value)}) indexedkv.cache() indexedkv.get("x" + "y") does not cache the whole index/dataset

debasish83 commented 9 years ago

Do I have to run indexedkv.get over every key once to load the index in memory ?

ankurdave commented 9 years ago

You can use indexedkv.cache().foreachPartition(x => {})

merlintang commented 9 years ago

From the ART paper, the range query based on the ARTindex can win against with B-tree. is it true for the RDDIndex?

debasish83 commented 9 years ago

Seems I should be able to compare with memcached for our internal datasets. Let me know if I should open a PR for benchmark. Also I would like to look into caching the index not the data for example. I can also do it as part of a Spark JIRA if plan is to merged IndexedRDD in 1.5.

debasish83 commented 9 years ago

In general does Redis/Memcached uses ARTIndex / B-Tree ? I thought they use HashTable

ankurdave commented 9 years ago

@merlintang Yes, all the read-only performance characteristics should be the same as for ART.

@debasish83 Sure, it would be nice to have a benchmark suite included. About caching the index only, one way to do it might be to split the index and data into two different RDDs, where the data RDD stores append-only arrays and the index RDD stores PART instances with offsets into the data arrays. I don't think IndexedRDD is planned to be merged into Spark as-is, so it might be best to do it using GitHub instead of JIRA.

debasish83 commented 9 years ago

Also can I move from ARTIndex to HashTable with a config in IndexedRDD ? Redis uses HashTable and if I can cache the data to memory, IndexedRDD with HashTable should further improve runtime..

merlintang commented 9 years ago

@ankurdave that is AWESOME !!

I think the storage overhead of ARTindex is not very big. if storing the index and data into two RDD, how can we collocate the indexRDD and dataRDD?

debasish83 commented 9 years ago

Dataset size: 40 GB column compressed, Raw 130 GB, Same query Spark SQL (cached) 12 s Impala (cached) 17s (composite-key, value) based IndexedRDD (cached) 320 ms (composite-key, value) lookup based on RDD (cached) 1.2s

I will finish comparisons with Memcached (Couchbase) before I close the issue. HBase/Cassandra are write optimized and I don't think they fit for this use-case.

I think IndexedRDD.lookup needs to be modified to use ARTree index. Also lookup is a bit more powerful than get since it can give SQL rows.

It might be a bit confusing to KV store users but we might be able to add lookup and multi-lookup in IndexedRDD where people need not groupBy before using IndexededRDD get (if the keys has skew, this issue is very common) and put a top to extract relevant items.

But then semantics of put and multi-put becomes confusing.

I will open up PR on index and data split idea that we discussed. Right now the cache size (data + index) is the issue and based on index cache if we can extract the data iterator from disk and can still maintain the ms runtime that will be cool.

debasish83 commented 9 years ago

But the memory requirement is very high. More than 2X. Is it expected ? Spark-SQL: 85 100% 48.3 GB IndexedRDD: 144 100% 134.2 GB

ankurdave commented 9 years ago

@debasish83 The overhead should be more like 20-30% compared to the raw data, but Spark SQL uses columnar compression for cached tables to reduce their size below the raw data size.

@merlintang I was thinking of using Spark to colocate the RDDs by ensuring that they are derived from the same base RDD (the input data). That should work regardless of the two RDDs' sizes.

debasish83 commented 9 years ago

@ankurdave are there plans of adding the index creation in Spark SQL in coming releases ? I think this functionality not only helps GraphX but helps us build a REST layer for Data and Model serving out of RDD as well...

ankurdave commented 9 years ago

@debasish83 I agree. I don't think there are plans, but it's something I'm very interested in as well.

debasish83 commented 9 years ago

@ankurdave I am sure you already thought on it but say I have a matrix factorization model in hive metastore and I have IndexedRDD constructed from user metadata and movie meta data also in hive metastore based on historical analysis (say hourly analysis, at time T I look at RDDs from previous hour). Using these 3 RDDs, I build a RecommendationService and Spray/Akka API manages N of these RecommendationService but again for each RecommendationService, I am not sure how many concurrent runJob can we start ? This issue should be very similar with Spark SQL hive thriftserver as well. I am assuming 1000 threads can be run concurrently in a FIFO mode from a SparkContext. I will test it out. Does something like this makes sense to you ? Or going towards Spark Job Server makes more sense ?

rbraley commented 8 years ago

@debasish83 any updates on your benchmarks?

amplab / spark-indexedrdd

Comparisons with KV stores (Redis and Memcached) #5