We write out tens of millions of entries from Spark to a Redis table that are versioned by the date - something like table_20211203. This works well.
Due to memory constraints, we need to delete these tables after a new deployment. However, when we go to delete, it takes over 30 minutes. Our code for delete looks like:
val emptyRedisTable = Seq(("foo", "bar"))
.toDF("name", "value")
.cache
emptyRedisTable
.write
.format("org.apache.spark.sql.redis")
.option("host", RedisHost)
.option("port", RedisPort)
.option("table", redisTableName)
.option("key.column", "name") // stream key
.option("ttl", 1) // set TTL to 1 second, so this record itself disappear too.
.mode(SaveMode.Overwrite)
.save()
Value
Improve deletion speeds of full tables.
Solution
Document and/or create a better way to delete an entire table written from Spark.
Details
I believe the issue is that the table's rows are stored as regular key-values. The result is a full table scan needs to be completed to understand which keys need to be deleted.
Instead, what if the table was stored as a hash? That way, I believe, a simple DEL table_name would suffice. Obviously, the columns would have to be something like id_column_name inside of the hash. So, some downstream operations could be more complicated.
Background
We write out tens of millions of entries from Spark to a Redis table that are versioned by the date - something like table_20211203. This works well.
Due to memory constraints, we need to delete these tables after a new deployment. However, when we go to delete, it takes over 30 minutes. Our code for delete looks like:
Value
Improve deletion speeds of full tables.
Solution
Document and/or create a better way to delete an entire table written from Spark.
Details
I believe the issue is that the table's rows are stored as regular key-values. The result is a full table scan needs to be completed to understand which keys need to be deleted.
Instead, what if the table was stored as a hash? That way, I believe, a simple
DEL table_name
would suffice. Obviously, the columns would have to be something like id_column_name inside of the hash. So, some downstream operations could be more complicated.