Not all the data from DataFrame got write to Redis

zhysn1201 commented 2 years ago

I have built a DataFrame roughly has 2K records, and each record has a timezone field and a id field where id is about 240,000 string length.

When I call df.write .format("org.apache.spark.sql.redis") .option("table", s"push:candidates:${date}") .option("key.column", "time_zone_indexed") .mode(SaveMode.Append) .save()

I observed that not all the data have been saved into Redis. Nearly 500 keys have been written in the Redis. Do you know why?

Things I have tried:

Extend Redis timeout to 10 mins
Extend spark.redis.max.pipeline.size to 200

fe2s commented 2 years ago

Did it fail? Any exceptions?

zhysn1201 commented 2 years ago

Nope. The spark job seems succeeded. I do not see any exceptions from the internal Spark portal. However, I am not sure if there is any other places that I can check the logs. Spark UI?

fe2s commented 2 years ago

I would try to check the driver and executor logs if there are any exceptions. You may also want to check the number of unique keys in your dataframe.

RedisLabs / spark-redis

Not all the data from DataFrame got write to Redis #347