Bulk re-index terminates early and then can't merge

charvolant commented 8 years ago

The bulk-indexer seems to sometimes stop indexing before

2016-07-05 15:16:26,032 INFO : [BulkProcessor] - [Indexer Thread 5] 52465000 >> Last key : dr376|CANB|CANB582931.1, records per sec: 248.57074
2016-07-05 15:16:30,002 INFO : [BulkProcessor] - [Indexer Thread 5] 52466000 >> Last key : dr376|CANB|CANB583877.1, records per sec: 251.88916
2016-07-05 15:16:34,504 INFO : [BulkProcessor] - [Indexer Thread 5] 52467000 >> Last key : dr376|CANB|CANB584821.1, records per sec: 222.1235
2016-07-05 15:16:38,381 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-05 15:16:39,212 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-05 15:16:39,261 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,274 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-05 15:16:39,275 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-05 15:35:22,488 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-05 15:35:22,493 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-05 15:35:22,493 INFO : [IndexRunner] - Total indexing time for this thread 2285.6086 minutes.
2016-07-05 15:35:22,493 INFO : [BulkProcessor] - Merging index segments
2016-07-05 15:35:22,493 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-05 15:35:22,512 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

or

2016-07-02 17:12:51,280 INFO : [BulkProcessor] - [Indexer Thread 1] 25037000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292868593, records per sec: 224.97186
2016-07-02 17:13:00,129 INFO : [BulkProcessor] - [Indexer Thread 1] 25038000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS292981770, records per sec: 113.00712
2016-07-02 17:13:04,440 INFO : [BulkProcessor] - [Indexer Thread 1] 25039000 >> Last key : dr2009|URN:CornellLabOfOrnithology:EBIRD:OBS293014172, records per sec: 231.96475
2016-07-02 17:13:06,151 INFO : [SolrIndexDAO] - Performing index commit....
2016-07-02 17:13:08,554 INFO : [SolrIndexDAO] - Performing index commit....done
2016-07-02 17:13:08,592 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - >>>> Document count of index: 7717688
2016-07-02 17:13:08,598 INFO : [SolrIndexDAO] - Optimising the indexing...
2016-07-02 17:20:15,928 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:20:15,953 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:20:15,953 INFO : [IndexRunner] - Total indexing time for this thread 995.2349 minutes.
2016-07-02 17:39:21,885 INFO : [SolrIndexDAO] - Shutting down the indexing...
2016-07-02 17:39:21,888 INFO : [SolrIndexDAO] - Finalise finished.
2016-07-02 17:39:21,888 INFO : [IndexRunner] - Total indexing time for this thread 1014.33386 minutes.
2016-07-02 17:39:21,888 INFO : [BulkProcessor] - Merging index segments
2016-07-02 17:39:21,888 INFO : [IndexMergeTool] - Merging to directory:  /data/biocache-reindex/solr/merged
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-0/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-1/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-2/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-3/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-4/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-5/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-6/data/index
Directory included in merge: /data/biocache-reindex/solr-create/biocache-thread-7/data/index
2016-07-02 17:39:21,900 INFO : [IndexMergeTool] - Adding indexes...
Exception in thread "main" org.apache.lucene.store.LockObtainFailedException: Lock obtain timed out: NativeFSLock@/data/biocache-reindex/solr-create/biocache-thread-2/data/index/write.lock
    at org.apache.lucene.store.Lock.obtain(Lock.java:89)
    at org.apache.lucene.index.IndexWriter.acquireWriteLocks(IndexWriter.java:2472)
    at org.apache.lucene.index.IndexWriter.addIndexes(IndexWriter.java:2526)
    at au.org.ala.biocache.index.IndexMergeTool$.merge(BulkProcessor.scala:256)
    at au.org.ala.biocache.index.BulkProcessor$.main(BulkProcessor.scala:164)
    at au.org.ala.biocache.cmd.CMD2$.main(CMD2.scala:134)
    at au.org.ala.biocache.cmd.CMD2.main(CMD2.scala)

At this point the re-process hangs

charvolant commented 8 years ago

The hang is possibly the same as that in https://github.com/AtlasOfLivingAustralia/biocache-store/issues/144

ansell commented 8 years ago

In the past I have seen the lucene lock obtain failure when a thread crashes for any reason, as it doesn't propagate the exception, just silently logging and ignoring it even though the index is corrupted or incomplete. I filed issue #124 for a similar case to this.

charvolant commented 8 years ago

So possibly an OutOfMemoryError? That would fit it appearing when more indexing is required.

ansell commented 8 years ago

Yes, OOM is what I originally found, but connection errors like #144 may also trigger it if they occur inside of the threads doing the indexing. The log file should show what the base cause was, but it will be buried somewhere in the middle because the exception doesn't crash the job early as would be preferred.

charvolant commented 8 years ago

Ah-ha,

Exception in thread "Thread-5" java.lang.RuntimeException: exception while unregistering MBean, com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:78)
    at org.scale7.cassandra.pelops.pool.PooledNode.<init>(PooledNode.java:68)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.addNode(CommonsBackedPool.java:415)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:137)
    at org.scale7.cassandra.pelops.pool.CommonsBackedPool.<init>(CommonsBackedPool.java:88)
    at org.scale7.cassandra.pelops.Pelops.addPool(Pelops.java:62)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.initialise(CassandraPersistenceManager.scala:55)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.getColumnsFromRowsWithRetries(CassandraPersistenceManager.scala:380)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOver(CassandraPersistenceManager.scala:346)
    at au.org.ala.biocache.persistence.CassandraPersistenceManager.pageOverAll(CassandraPersistenceManager.scala:470)
    at au.org.ala.biocache.index.IndexRunner.run(IndexRecordMultiThreaded.scala:466)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.management.InstanceNotFoundException: com.scale7.cassandra.pelops.pool:type=PooledNode-occ-bie-aws.ala.org.au
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.getMBean(DefaultMBeanServerInterceptor.java:1095)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.exclusiveUnregisterMBean(DefaultMBeanServerInterceptor.java:427)
    at com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.unregisterMBean(DefaultMBeanServerInterceptor.java:415)
    at com.sun.jmx.mbeanserver.JmxMBeanServer.unregisterMBean(JmxMBeanServer.java:546)
    at org.scale7.cassandra.pelops.JmxMBeanManager.unregisterMBean(JmxMBeanManager.java:75)
    ... 11 more

ansell commented 8 years ago

That looks like it fits the behaviour I saw, as it is being pushed through as a RuntimeException which may be escaping through the checked Exception case at:

https://github.com/AtlasOfLivingAustralia/biocache-store/blob/master/src/main/scala/au/org/ala/biocache/index/IndexRecordMultiThreaded.scala#L481

ansell commented 8 years ago

Actually, that looks like it may have happened just before the try block, further up in CassandraPersistenceManager

adam-collins commented 7 years ago

https://github.com/AtlasOfLivingAustralia/biocache-store/commit/349791afaa15c8e3c22a288f8a686304b967ed6c adds more logging and makes retries more robust. Hopefully this helps.

djtfmartin commented 6 years ago

can this be closed @charvolant @ansell ?

ansell commented 6 years ago

The change may prevent the exception escaping by suppressing but would that cause the creation of a partial index?

AtlasOfLivingAustralia / biocache-store

Bulk re-index terminates early and then can't merge #146