RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
939 stars 372 forks source link

Use mget instead of pipeline get to read kv from redis #264

Open mayankasthana opened 4 years ago

mayankasthana commented 4 years ago

It is recommended to use mget to do multiple gets instead of pipelining lots of gets together. A few benchmarks also show performance improvements.

I don't see any drawback to doing it, and it might be a simple enough change. Is there anything I am missing that makes using mget nonviable?

Maybe we can do it for hmget to read hashes as well.

fe2s commented 4 years ago

Hi @gkorland @itamarhaber Could you please share your opinion on this.

gkorland commented 4 years ago

@mayankasthana notice MGET is only relevant to String values and not for Hash. HMGET on the other hand is a way to retrieve multi values of a single Key similar to HGETALL with a way to limit the fields retrieved from the Hash

gkorland commented 4 years ago

@mayankasthana sorry you right on RDD MGET will probably get better latency in some scenarios, a mix of both MGET and pipeline should probably work the best (to avoid to large MGETs)

mayankasthana commented 4 years ago

Maybe we can reuse the spark.redis.max.pipeline.size property for the mget size, since in this context they are equivalent.

mayankasthana commented 4 years ago

So I just found out that all keys asked in one MGET should be from the same slot and it is not enough that the keys be from the same node in a cluster setup. I prototyped some code here to first sort all keys by slot ids, and then take spark.redis.max.pipeline.size items and create multiple mget calls in multiple pipelines. It works, but looks unwieldy and doesn't look very efficient at least from spark side. Can't say from redis side. If there is not much benefit from redis side, this might not be a good idea at all.