RedisLabs / spark-redis

A connector for Spark that allows reading and writing to/from Redis cluster
BSD 3-Clause "New" or "Revised" License
939 stars 370 forks source link

Read getting stuck at stage 0 #366

Open arturzangiev opened 1 year ago

arturzangiev commented 1 year ago

Trying to read a dataframe from redis instance of AWS, but get stuck at stage 0.

[Stage 0:> (0 + 1) / 1]

self.__spark = SparkSession.builder\
            .config('spark.jars.packages', 'com.redislabs:spark-redis_2.12:3.1.0')\
            .config("spark.redis.host", "AWS-HOST")\
            .config("spark.redis.port", "6379")\
            .getOrCreate()

 def __read_redis_keys(self) -> DataFrame:
        df = self.__spark.read.format("org.apache.spark.sql.redis")\
            .option("keys.pattern", "SOME_PATH*")\
            .option("infer.schema", True)\
            .load()
        return df

Spark 3.3.1 Scala 2.12.15 Java 17.0.1 Python 3.8.14 pyspark 3.3.1 Macbook M1

arturzangiev commented 1 year ago

I managed to figure it out. It is clearly networking issue to do with AWS Elasticache. As I deployed to EMR the job successfully get executed. The thing I can't figure out now is why I can't execute it locally as I am on VPN and if I just use redis-cli I can access Elasticache fine. It looks like spark locally can't assign IP correctly.