Open charvolant opened 8 years ago
The hang is possibly the same as that in https://github.com/AtlasOfLivingAustralia/biocache-store/issues/144
In the past I have seen the lucene lock obtain failure when a thread crashes for any reason, as it doesn't propagate the exception, just silently logging and ignoring it even though the index is corrupted or incomplete. I filed issue #124 for a similar case to this.
So possibly an OutOfMemoryError? That would fit it appearing when more indexing is required.
Yes, OOM is what I originally found, but connection errors like #144 may also trigger it if they occur inside of the threads doing the indexing. The log file should show what the base cause was, but it will be buried somewhere in the middle because the exception doesn't crash the job early as would be preferred.
Ah-ha,
Exception in thread "Thread-5" java.lang.RuntimeException: exception while unregistering MBean, com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:78)
at org.scale7.cassandra.pelops.pool.PooledNode.<init>(PooledNode.java:68)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.addNode(CommonsBackedPool.java:415)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:137)
at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:88)
at org.scale7.cassandra.pelops.Pelops.addPool(Pelops.java:62)
at au.org.ala.biocache.persistence.CassandraPersistenceManager.initialise(CassandraPersistenceManager.scala:55)
at au.org.ala.biocache.persistence.CassandraPersistenceManager.getColumnsFromRowsWithRetries(CassandraPersistenceManager.scala:380)
at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOver(CassandraPersistenceManager.scala:346)
at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOverAll(CassandraPersistenceManager.scala:470)
at au.org.ala.biocache.index.IndexRunner.run(IndexRecordMultiThreaded.scala:466)
at java.lang.Thread.run(Thread.java:745)
Caused by: javax.management.InstanceNotFoundException: com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean(DefaultMBeanServerInterceptor.java:427)
at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean(DefaultMBeanServerInterceptor.java:415)
at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean(JmxMBeanServer.java:546)
at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:75)
... 11 more
That looks like it fits the behaviour I saw, as it is being pushed through as a RuntimeException which may be escaping through the checked Exception case at:
Actually, that looks like it may have happened just before the try block, further up in CassandraPersistenceManager
https://github.com/AtlasOfLivingAustralia/biocache-store/commit/349791afaa15c8e3c22a288f8a686304b967ed6c adds more logging and makes retries more robust. Hopefully this helps.
can this be closed @charvolant @ansell ?
The change may prevent the exception escaping by suppressing but would that cause the creation of a partial index?
The bulk-indexer seems to sometimes stop indexing before
or
At this point the re-process hangs