bartosz25 / spark-scala-playground

Sample processing code using Spark 2.1+ and Scala
50 stars 25 forks source link

RDD cache eviction policy #8

Closed bithw1 closed 5 years ago

bithw1 commented 5 years ago

Hi @bartosz25 I am reading your post http://www.waitingforcode.com/apache-spark/cache-in-spark/read

It has the following text

When cache is stored only on memory, it has a certain amount of reserved space. When this space is full and we want to cache more RDDs, the oldest ones are evicted according to LRU (Least Recently Used) policy.

I think RDD eviction is not using LRU policy. given a sequence as follows:

cache RDD1
//after some time
cache RDD2
//after some time
use RDD1
//after some time
cache RDD3

When caching RDD3, let's assume RDD1 and RDD2 both can be elected to be evicted to hold RDD3.

If LRU policy is used, then RDD2 will be evicted.

But I think RDD1 will be evicted.

I think the eviction policy is in the code: MemoryStore#evictBlocksToFreeSpace, it finds the RDD that should be evicted by iterating the MemoryStore#entries, entries is a LinkedHashMap, which works like a queue, so the first RDD that put into the entries will be evicted, here is RDD1

Not sure I have understood correctly, @bartosz25 .

bartosz25 commented 5 years ago

Hi,

IMO your intuition is correct - RDD1 will be evicted. The wikipedia's definition shows Spark's implementation of LRU: https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU) and the official documentation confirms that https://spark.apache.org/docs/latest/rdd-programming-guide.html#removing-data

Also I think that everything depends on the implementation details because Ehcache's LRU policy effectively updates the retrieval time too: http://www.ehcache.org/documentation/2.7/apis/cache-eviction-algorithms.html

Best regards, Bartosz.

bithw1 commented 5 years ago

Thanks @bartosz25 for the answer, I understood now,although I still think Spark is not using LRU,:-)