RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
936 stars 367 forks source link

Not all the data from DataFrame got write to Redis #347

Open zhysn1201 opened 2 years ago

zhysn1201 commented 2 years ago

I have built a DataFrame roughly has 2K records, and each record has a timezone field and a id field where id is about 240,000 string length.

When I call df.write .format("org.apache.spark.sql.redis") .option("table", s"push:candidates:${date}") .option("key.column", "time_zone_indexed") .mode(SaveMode.Append) .save()

I observed that not all the data have been saved into Redis. Nearly 500 keys have been written in the Redis. Do you know why?

Things I have tried:

  1. Extend Redis timeout to 10 mins
  2. Extend spark.redis.max.pipeline.size to 200
fe2s commented 2 years ago

Did it fail? Any exceptions?

zhysn1201 commented 2 years ago

Nope. The spark job seems succeeded. I do not see any exceptions from the internal Spark portal. However, I am not sure if there is any other places that I can check the logs. Spark UI?

fe2s commented 2 years ago

I would try to check the driver and executor logs if there are any exceptions. You may also want to check the number of unique keys in your dataframe.