RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
939 stars 372 forks source link

Cannot switch redis db to anything other than default db in same Spark session. #257

Closed DecisionSystems closed 4 years ago

DecisionSystems commented 4 years ago

For some reason, I can't seem get a df.write to pickup a change to the the Redis db. I can confirm that the pyspark.sql.conf.RuntimeConfig has been changed, however only the default is ever used. This issue was reviewed by redis labs.

The following code works however the db context does not change to 3 so all tables on write go to 0 which is the default. If I were to change the option("keys.pattern"...) in the df.write to be the same as the df.read it says the tables already exist. Here's the code. Using Python with Redis 3.4.1 / spark-redis-2.3.1-M2 / Spark 2.3.1 / Python 3.6.5

 spark = SparkSession.builder.appName("myApp").config('repositories','app/spark-apps/spark-redis-2.3.1-M2-jar-with- 
    dependencies.jar').config("spark.redis.host", "redis-cache").config("spark.redis.port", "6379").config("spark.redis.dbNum", 
    "0").getOrCreate()

    spark.conf.set("master", "spark://spark-master:7077")
    sc = spark.sparkContext
    sc.addPyFile("spark-apps/appx_wordcount_redis.zip")
    sc.setLogLevel("ERROR")

    df = spark.read.format("org.apache.spark.sql.redis")\
                            .option("keys.pattern", "rec:*")\
                            .option("infer.schema", True).load()
    df.show(10)
    df.printSchema()

    spark.conf.set("spark.redis.db", 3)
    df.write.format("org.apache.spark.sql.redis")\
            .option("table", "recx:*")\
            .option("key.column", "_id").save()

    spark.stop()
fe2s commented 4 years ago

Hi @DecisionSystems , You should use spark.redis.db config option, not spark.redis.dbNum. Documentation https://github.com/RedisLabs/spark-redis/blob/master/doc/configuration.md

DecisionSystems commented 4 years ago

Yes. Cleary the documentation needs to be cleared up by perhaps removing references to the parameter dbNum

I also was able to change the redis db context that I wanted to write to within the same spark session using pyspark.

It turns out that you must redefine the spark session using the .getOrCreate() method. As per the pyspark doc: https://spark.apache.org/docs/latest/api/python/pyspark.sql.html “In case an existing SparkSession is returned, the config options specified in this builder will be applied to the existing SparkSession.”

This is exactly what I needed to do. After that, it worked like a charm. So, I suppose the only issue here is the incorrect reference to dbNum.

Thanks!

fe2s commented 4 years ago

Docs updated in https://github.com/RedisLabs/spark-redis/pull/259