RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
940 stars 372 forks source link

New API Request: Write certain column of DataFrame as Strings in Redis #262

Open VarunWachaspati opened 4 years ago

VarunWachaspati commented 4 years ago

Currently there two ways to write each row of a Dataframe to Redis -

  1. Hashes (Standard)
  2. Strings via Binary Persistence Model, which makes it difficult for an external consumer to read.

In most of my spark workloads, there only two columns of interest usually. Namely a unique_row_identifier and a computed metric/label. Storing a redis string with key unique_row_identifier and stringified value metric/label is beneficial because of easy query pattern for consumer and lower memory consumption on Redis (as strings are lighter than hashes).

So if there was an API similar to the following -

df.write
  .format("org.apache.spark.sql.redis")
  .option("model", "string")
  .option("key.column", "unique_row_identifier")
  .option("value.column", "metric/label")
  .save()

The serialization to string of the key/value is the responsibility of the consumer of the library. We can throw an appropriate exception in case of non-string types being passed as keys/values for this model.

Anyway for non-string types, we already have Binary Persistence Model.

Let me know your thoughts if this is valid and feasible new API request.

fe2s commented 4 years ago

Hi @VarunWachaspati , did you consider converting your dataframe to a key/value pair RDD and saving to RDD then? https://github.com/RedisLabs/spark-redis/blob/master/doc/rdd.md#strings-1 It will store the RDD as Redis strings.

VarunWachaspati commented 4 years ago

Yes, I have been converting my DataFrames to RDD and then writing as the RDD based APIs are very flexible for now. Was wondering if having Dataframe based API for the same would be helpful or not. As it would be very straightforward and intuitive to use.

fe2s commented 4 years ago

Yep, we might want to introduce it, but for now it's a low priority since one can use the alternative API to achieve the same.