Closed bithw1 closed 5 years ago
Hi,
IMO your intuition is correct - RDD1 will be evicted. The wikipedia's definition shows Spark's implementation of LRU: https://en.wikipedia.org/wiki/Cache_replacement_policies#Least_recently_used_(LRU) and the official documentation confirms that https://spark.apache.org/docs/latest/rdd-programming-guide.html#removing-data
Also I think that everything depends on the implementation details because Ehcache's LRU policy effectively updates the retrieval time too: http://www.ehcache.org/documentation/2.7/apis/cache-eviction-algorithms.html
Best regards, Bartosz.
Thanks @bartosz25 for the answer, I understood now,although I still think Spark is not using LRU,:-)
Hi @bartosz25 I am reading your post http://www.waitingforcode.com/apache-spark/cache-in-spark/read
It has the following text
I think RDD eviction is not using LRU policy. given a sequence as follows:
When caching RDD3, let's assume RDD1 and RDD2 both can be elected to be evicted to hold RDD3.
If LRU policy is used, then RDD2 will be evicted.
But I think RDD1 will be evicted.
I think the eviction policy is in the code: MemoryStore#evictBlocksToFreeSpace, it finds the RDD that should be evicted by iterating the MemoryStore#entries, entries is a LinkedHashMap, which works like a queue, so the first RDD that put into the entries will be evicted, here is RDD1
Not sure I have understood correctly, @bartosz25 .