About RDD immutability - Githubissues

bithw1 commented 5 years ago

Hi @bartosz25

I have a basic question about one of RDD's characteristics : immutability. I have a simple code snippet:

val rdd = spark.jdbc.("...").rdd //read a db table as rdd
rdd.count //(1)
//here, insert some data to the db table, before count againt
rdd.count //(2)

The first count and the second count are not the same,but they are the same rdd， that is, the immutability doesn't hold in this case, so I now don't understand immutability really means, I would ask what you think about RDD immutability

bartosz25 commented 5 years ago

Hi @bithw1

If between (1) and (2) you add a rdd.cache, you'll get the same results. Otherwise the RDD is recomputed once again.

Simply speaking, immutability is not regarding to the data source but RDD. If you apply some transformation (map, filter, ...), then they will create a new RDD every time - the source RDD will stay the same. But as you remarked in the test case, the immutability doesn't apply to the physical data - unless you cache the data between the computation.

To give more insight, you can:

enable query logging in your SQL data source and check what queries are issued by Spark
add cache between 2 calls

Hope it helps.

Best regards, Bartosz.

bithw1 commented 5 years ago

Thanks @bartosz25 for your great explanation, it is clear and makes sense to me. I know in my case, the count action will fire 2 different queries against the table, so that counts bwtween the two are different, I have been kind of stuck with physical data source and RDD

bartosz25 / spark-scala-playground

About RDD immutability #5