elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.95k stars 24.74k forks source link

Cache#computeIfAbsent can lead to deadlocks #14090

Closed jasontedor closed 9 years ago

jasontedor commented 9 years ago

The current implementation of Cache#computeIfAbsent can lead to deadlocks in situations where dependent key loading is occurring. This is the case because of the locks that are taken to ensure that the loader is invoked at most once per key. In particular, consider two threads t1 and t2 invoking this method for keys k1 and k2 which will both trigger dependent calls to Cache#computeIfAbsent for keys kd1 and kd2. In cases when k1 and kd2and in the same segment, andk2and kd1 are in the same segment then:

  1. t1 locks the segment for k1
  2. t2 locks the segment for k2
  3. t1 blocks waiting for the lock for the segment for kd1
  4. t2 blocks waiting for the lock for the segment for kd2

is a deadlock. This unfortunate situation surfaced in a failed build.

"elasticsearch[node_s0][warmer][T#5]" ID=19141 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2255a366 owned by "elasticsearch[node_s0][warmer][T#4]" ID=19139
    at sun.misc.Unsafe.park(Native Method)
    - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2255a366
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
    at org.elasticsearch.common.cache.Cache$CacheSegment.get(Cache.java:187)
    at org.elasticsearch.common.cache.Cache.get(Cache.java:279)
    at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:300)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:150)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexFieldData.load(AbstractIndexFieldData.java:80)
    at org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build(GlobalOrdinalsBuilder.java:52)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.localGlobalDirect(AbstractIndexOrdinalsFieldData.java:80)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.localGlobalDirect(AbstractIndexOrdinalsFieldData.java:41)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.lambda$load$27(IndicesFieldDataCache.java:179)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$$Lambda$366/28099691.load(Unknown Source)
    at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:311)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:174)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.loadGlobal(AbstractIndexOrdinalsFieldData.java:68)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.loadGlobal(AbstractIndexOrdinalsFieldData.java:41)
    at org.elasticsearch.search.SearchService$FieldDataWarmer$3.run(SearchService.java:1019)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.ThreadPoolExecutor$Worker@41b4471c
    - java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@23fdb477

"elasticsearch[node_s0][warmer][T#4]" ID=19139 WAITING on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@23fdb477 owned by "elasticsearch[node_s0][warmer][T#5]" ID=19141
    at sun.misc.Unsafe.park(Native Method)
    - waiting on java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@23fdb477
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:967)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1283)
    at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:727)
    at org.elasticsearch.common.util.concurrent.ReleasableLock.acquire(ReleasableLock.java:55)
    at org.elasticsearch.common.cache.Cache$CacheSegment.get(Cache.java:187)
    at org.elasticsearch.common.cache.Cache.get(Cache.java:279)
    at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:300)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:150)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexFieldData.load(AbstractIndexFieldData.java:80)
    at org.elasticsearch.index.fielddata.ordinals.GlobalOrdinalsBuilder.build(GlobalOrdinalsBuilder.java:52)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.localGlobalDirect(AbstractIndexOrdinalsFieldData.java:80)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.localGlobalDirect(AbstractIndexOrdinalsFieldData.java:41)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.lambda$load$27(IndicesFieldDataCache.java:179)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache$$Lambda$366/28099691.load(Unknown Source)
    at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:311)
    at org.elasticsearch.indices.fielddata.cache.IndicesFieldDataCache$IndexFieldCache.load(IndicesFieldDataCache.java:174)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.loadGlobal(AbstractIndexOrdinalsFieldData.java:68)
    at org.elasticsearch.index.fielddata.plain.AbstractIndexOrdinalsFieldData.loadGlobal(AbstractIndexOrdinalsFieldData.java:41)
    at org.elasticsearch.search.SearchService$FieldDataWarmer$3.run(SearchService.java:1019)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)
    Locked synchronizers:
    - java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync@2255a366
    - java.util.concurrent.ThreadPoolExecutor$Worker@42f5ef87
jasontedor commented 9 years ago

There are two pull requests open to address this issue. The first is #14068 which relaxes the constraint that Cache#computeIfAbsent be called at most once per key. The second is #14091 which changes the synchronization mechanism to be the key itself so that loading does not occur under the segment lock.

Only one of these two pull requests should be merged into master but given the feedback on #14068 from @jpountz I wanted to explore a different approach for solving the deadlock issue.