RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
939 stars 372 forks source link

How to ignore exception during save #197

Open iyer-r opened 5 years ago

iyer-r commented 5 years ago

Hello!

I am using Spark Redis to populate a Redis cluster as follows --

df.sample(sampleRate)
        .write
        .mode(SaveMode.Append)
        .format("org.apache.spark.sql.redis")
        .option("table", table)
        .option("key.column", column)
        .option("ttl", ttl)
        .save()

The Redis cluster itself is hosted on the Kubernetes Engine with a load balancer service and the individual nodes can down and come back during the load process.

If one of the nodes goes down, I see an error like -- Caused by: redis.clients.jedis.exceptions.JedisConnectionException: Failed connecting to host 10.4.1.162:6379

When this happens, the whole job terminates. Instead, I would like to ignore this exception and try to continue loading.

Is there a way to achieve that?

Regards

fe2s commented 5 years ago

Hi @iyer-r,

I didn't fully understand the idea of continuing loading while some node is down. Could you please clarify how it should happen under the hood in your opinion? Should it wait until the node is available with a timeout? There might be a Jedis configuration option for that, I will need to check.

In general Spark makes several attempts (retries) before failing the job. Do you see it in your case? As a workaround you may try to increase the number of retries.

iyer-r commented 5 years ago

Hi!

Kubernetes deployment of Redis cluster uses persistent volumes and stateful sets. As a result, when one of the nodes goes down (for any reason), a new one comes back up with the same data.

Access to the Redis cluster is via a load balancer IP. Subsequently, individual requests can go to any of the internal nodes.

As you mention, there are attempts to retry, however, the retries are all to the internal node that went down and not to the load balancer IP. If the retry can be to the load balancer IP, then the save can continue on other nodes, while the node that went down, comes up.

Hope that makes sense.