Terracotta-OSS / offheap-store

A library that provides a set of map and cache implementations that store data outside of the normal Java heap
Apache License 2.0
90 stars 46 forks source link

NullPointer #64

Open iainstarks opened 4 years ago

iainstarks commented 4 years ago

We are using clutered ehcache 3.8.1 and after some time ( about 6 hours and under some load ) we see the exception below. It appears to then lose cache entries. The exception then repeats roughly every second.

Prior to that there was a single warning - which is also pasted below some 15 mins before the repeating error. Earlier still there was a single ERROR, again I've pasted this below.

Any suggestions of what the cause might be greatly appreciated.

Repeating error:

2019-12-06 22:41:55,843 [WorkerThread(request_processor_stage, 2, 2)] ERROR org.ehcache.clustered.server.store.ClusterTierActiveEntity - Unexpected exception raised during operation: APPEND
java.lang.NullPointerException: null
at org.terracotta.offheapstore.paging.OffHeapStorageArea.readInt(OffHeapStorageArea.java:173)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.isHead(OffHeapChainStorageEngine.java:776)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.findHead(OffHeapChainStorageEngine.java:769)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.access$1300(OffHeapChainStorageEngine.java:42)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine$StorageOwner.evictAtAddress(OffHeapChainStorageEngine.java:783)
at org.terracotta.offheapstore.paging.OffHeapStorageArea.release(OffHeapStorageArea.java:579)
at org.terracotta.offheapstore.paging.OffHeapStorageArea.shrink(OffHeapStorageArea.java:681)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.shrink(OffHeapChainStorageEngine.java:340)
at org.terracotta.offheapstore.AbstractLockedOffHeapHashMap.shrink(AbstractLockedOffHeapHashMap.java:500)
at org.ehcache.clustered.server.offheap.OffHeapChainMap.shrink(OffHeapChainMap.java:345)
at org.ehcache.clustered.server.offheap.OffHeapServerStore.tryShrinkOthers(OffHeapServerStore.java:242)
at org.ehcache.clustered.server.offheap.OffHeapServerStore.handleOversizeMappingException(OffHeapServerStore.java:212)
at org.ehcache.clustered.server.offheap.OffHeapServerStore.consumeOversizeMappingException(OffHeapServerStore.java:196)
at org.ehcache.clustered.server.offheap.OffHeapServerStore.append(OffHeapServerStore.java:124)
at org.ehcache.clustered.server.ServerStoreImpl.append(ServerStoreImpl.java:86)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeServerStoreOperation(ClusterTierActiveEntity.java:425)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActiveInternal(ClusterTierActiveEntity.java:329)
at org.terracotta.client.message.tracker.OOOMessageHandlerImpl.invoke(OOOMessageHandlerImpl.java:75)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActive(ClusterTierActiveEntity.java:317)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActive(ClusterTierActiveEntity.java:134)
at com.tc.objectserver.entity.ManagedEntityImpl.performAction(ManagedEntityImpl.java:879)
at com.tc.objectserver.entity.ManagedEntityImpl.lambda$invoke$12(ManagedEntityImpl.java:591)
at java.util.Optional.ifPresent(Optional.java:159)
at com.tc.objectserver.entity.ManagedEntityImpl.invoke(ManagedEntityImpl.java:591)
at com.tc.objectserver.entity.ManagedEntityImpl.lambda$processInvokeRequest$5(ManagedEntityImpl.java:376)
at com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.accept(ManagedEntityImpl.java:1311)
at com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.accept(ManagedEntityImpl.java:1247)
at com.tc.objectserver.entity.RequestProcessor$EntityRequest.invoke(RequestProcessor.java:182)
at com.tc.objectserver.entity.RequestProcessor$EntityRequest.run(RequestProcessor.java:172)
at com.tc.objectserver.entity.RequestProcessorHandler.handleEvent(RequestProcessorHandler.java:28)
at com.tc.objectserver.entity.RequestProcessorHandler.handleEvent(RequestProcessorHandler.java:24)
at com.tc.async.impl.StageImpl.lambda$null$1(StageImpl.java:115)
at com.tc.async.impl.MonitoringEventCreator.lambda$createEvent$0(MonitoringEventCreator.java:77)
at com.tc.async.impl.StageImpl$WorkerThread.run(StageImpl.java:307)

Warning:

2019-12-06 22:26:29,634 [WorkerThread(request_processor_stage, 3, 3)] WARN org.terracotta.offheapstore.OffHeapHashMap - Expanding reprobe sequence from 512 slots to 1024 slots [load-factor=0.48480225]

Single ERROR:

2019-12-06 04:53:40,671 [WorkerThread(request_processor_stage, 4, 4)] ERROR org.ehcache.clustered.server.store.ClusterTierActiveEntity - Unexpected exception raised during operation: ITERATOR_OPEN
java.lang.NullPointerException: null
at org.terracotta.offheapstore.paging.OffHeapStorageArea.readInt(OffHeapStorageArea.java:173)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.readElementLength(OffHeapChainStorageEngine.java:162)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine.access$700(OffHeapChainStorageEngine.java:42)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine$AttachedInternalChain.readElementBuffer(OffHeapChainStorageEngine.java:687)
at org.ehcache.clustered.server.offheap.OffHeapChainStorageEngine$AttachedInternalChain.detach(OffHeapChainStorageEngine.java:473)
at org.ehcache.clustered.server.offheap.OffHeapChainMap$HeadMap$1.create(OffHeapChainMap.java:368)
at org.ehcache.clustered.server.offheap.OffHeapChainMap$HeadMap$1.create(OffHeapChainMap.java:362)
at org.terracotta.offheapstore.OffHeapHashMap$HashIterator.(OffHeapHashMap.java:1390)
at org.terracotta.offheapstore.OffHeapHashMap$EntryIterator.(OffHeapHashMap.java:1525)
at org.terracotta.offheapstore.AbstractLockedOffHeapHashMap$LockedEntryIterator.(AbstractLockedOffHeapHashMap.java:385)
at org.ehcache.clustered.server.offheap.OffHeapChainMap$HeadMap$1.(OffHeapChainMap.java:362)
at org.ehcache.clustered.server.offheap.OffHeapChainMap$HeadMap.detachedEntryIterator(OffHeapChainMap.java:362)
at org.ehcache.clustered.server.offheap.OffHeapChainMap.iterator(OffHeapChainMap.java:243)
at org.ehcache.clustered.server.offheap.OffHeapServerStore$1.getNextIterator(OffHeapServerStore.java:374)
at org.ehcache.clustered.server.offheap.OffHeapServerStore$AggregateIterator.hasNext(OffHeapServerStore.java:410)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.iteratorBatch(ClusterTierActiveEntity.java:616)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeServerStoreOperation(ClusterTierActiveEntity.java:518)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActiveInternal(ClusterTierActiveEntity.java:329)
at org.terracotta.client.message.tracker.OOOMessageHandlerImpl.invoke(OOOMessageHandlerImpl.java:79)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActive(ClusterTierActiveEntity.java:317)
at org.ehcache.clustered.server.store.ClusterTierActiveEntity.invokeActive(ClusterTierActiveEntity.java:134)
at com.tc.objectserver.entity.ManagedEntityImpl.performAction(ManagedEntityImpl.java:879)
at com.tc.objectserver.entity.ManagedEntityImpl.lambda$invoke$12(ManagedEntityImpl.java:591)
at java.util.Optional.ifPresent(Optional.java:159)
at com.tc.objectserver.entity.ManagedEntityImpl.invoke(ManagedEntityImpl.java:591)
at com.tc.objectserver.entity.ManagedEntityImpl.lambda$processInvokeRequest$5(ManagedEntityImpl.java:376)
at com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.accept(ManagedEntityImpl.java:1311)
at com.tc.objectserver.entity.ManagedEntityImpl$SchedulingRunnable.accept(ManagedEntityImpl.java:1247)
at com.tc.objectserver.entity.RequestProcessor$EntityRequest.invoke(RequestProcessor.java:182)
at com.tc.objectserver.entity.RequestProcessor$EntityRequest.run(RequestProcessor.java:172)
at com.tc.objectserver.entity.RequestProcessorHandler.handleEvent(RequestProcessorHandler.java:28)
at com.tc.objectserver.entity.RequestProcessorHandler.handleEvent(RequestProcessorHandler.java:24)
at com.tc.async.impl.StageImpl.lambda$null$1(StageImpl.java:115)
at com.tc.async.impl.MonitoringEventCreator.lambda$createEvent$0(MonitoringEventCreator.java:77)
at com.tc.async.impl.StageImpl$WorkerThread.run(StageImpl.java:307)

Terracotta is configured with a single off-heap resource:

<ohr:offheap-resources>
<ohr:resource name="offheap" unit="MB">2048</ohr:resource>
</ohr:offheap-resources>

and the cache being exercised:

<ehcache:cache alias="mycaache">
<ehcache:key-type copier="org.ehcache.impl.copy.SerializingCopier">java.lang.String</ehcache:key-type>
<ehcache:value-type copier="org.ehcache.impl.copy.SerializingCopier">MyCacheValue</ehcache:value-type>
<ehcache:expiry>
  <ehcache:tti unit="days">1</ehcache:tti>
</ehcache:expiry>

<ehcache:resources>
  <ehcache:heap unit="MB">100</ehcache:heap>
  <ehcache:offheap unit="MB">150</ehcache:offheap>
  <terracotta:clustered-dedicated unit="MB">200</terracotta:clustered-dedicated>
</ehcache:resources>
<terracotta:clustered-store consistency="strong"/>

</ehcache:cache>
chrisdennis commented 4 years ago

@iainstarks I can certainly see the bug that caused the initial error. The second error is 'harmless' although it may be a symptom of others things happening. I'm not fully confident that the bug that caused the initial error is also the cause of the subsequent exceptions however.

If you could describe more about how you use the cache that might help. In particular:

If you could upload a full set of logs somewhere so I can look through them that might be useful too. In the meantime I can build you a copy of Ehcache 3 with this first fix applied, and we can cross out fingers that I'm missing something and this does fix both problems.

iainstarks commented 4 years ago

Hi Chris, Thanks for the response. The cache holds some location info, so the key is a String and the value is a Pojo with the following content:public class LocationCheck implements Serializable { /* */ private static final long serialVersionUID = 3033653990506158489L; private final LocationCheckType type; private final String alias; private final String region; private final long lastChecked; private final String result; private final String lat; private final String lng; private final long distanceInMiles; private final boolean passed; private final int delay; private final StatusInfo statusInfo;

LocationCheckType is an enum StatusInfo a wrapper around a List of ErrorInfo - which a class with two Strings

The list will usually be empty or have a single entry. Not very big.

I think a log will be small enough to email. I'll try that.

iainstarks commented 4 years ago

Drop me an email to iainstarks@gmail.com and I'll send you the logs.

As for operations, we do a put, get a few times and then most likely a remove. We might repeat the put if the info changes. It has a tti of one day.

I would guess there would be a few thousand entries.

There are a few other caches, but I would think that one is the likely culprit.

chrisdennis commented 4 years ago

I think I know what the bigger issue is and have a fix lined up. More accurately I've found a real issue and it might be the cause of your bug. Do you have a testing environment we can try a build out in to see if this fixes the issue for you?

iainstarks commented 4 years ago

Replied by email, but just to re-iterate: no this issue has only appeared for us on a production platform. We've tried fairly hard to reproduce it elsewhere, with load tests, and general testing and it has also not occurred on different production environments. If you have some pointers as to what might cause it and how we might reproduce then that would be great. I'd really like to be able to do that so we could have some confidence that the issue is fixed and won't happen again. We're a little backed into a corner otherwise as you can imagine.