RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
935 stars 368 forks source link

Feature Request: improve table deletion speed #332

Open jeisinge opened 2 years ago

jeisinge commented 2 years ago

Background

We write out tens of millions of entries from Spark to a Redis table that are versioned by the date - something like table_20211203. This works well.

Due to memory constraints, we need to delete these tables after a new deployment. However, when we go to delete, it takes over 30 minutes. Our code for delete looks like:

val emptyRedisTable = Seq(("foo", "bar"))
    .toDF("name", "value")
    .cache

  emptyRedisTable
    .write
    .format("org.apache.spark.sql.redis")
    .option("host", RedisHost)
    .option("port", RedisPort)
    .option("table", redisTableName)
    .option("key.column", "name")         // stream key
    .option("ttl", 1)  // set TTL to 1 second, so this record itself disappear too.
    .mode(SaveMode.Overwrite)
    .save()

Value

Improve deletion speeds of full tables.

Solution

Document and/or create a better way to delete an entire table written from Spark.

Details

I believe the issue is that the table's rows are stored as regular key-values. The result is a full table scan needs to be completed to understand which keys need to be deleted.

Instead, what if the table was stored as a hash? That way, I believe, a simple DEL table_name would suffice. Obviously, the columns would have to be something like id_column_name inside of the hash. So, some downstream operations could be more complicated.