elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

NullPointerException in TransportShardBulkAction #4224

Closed spinscale closed 10 years ago

spinscale commented 10 years ago

This happened on elasticsearch 0.90.6 with JVM 1.7.0_25

Found this in logs, cannot tell what triggered this. The only thing I know is, there are lots of index/search operations going and there seems to be some cluster instability.

[2013-XX-YY 08:45:22,518][DEBUG][action.bulk              ] [myNode] [logstash-2013.XX.YY][3], node[vXE6ojncQUG-foPsMjVY_w], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.bulk.BulkShardRequest@28c7fa4f]
java.lang.NullPointerException
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:138)
    at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:75)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:610)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:557)
    at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:724)
avleen commented 10 years ago

This just bit us tonight, too, somewhat out of the blue.

Immediately before this I see:

[2013-11-26 02:41:52,659][INFO ][discovery.zen            ] [myDataNode] master_left [[logstash01.ny4.etsy.com][swmFFvkEQHaBDyYtBSeSbA][inet[/ip.add.re.ss:9300]]{tag=archive, data=false, master=true}], reason [do not exists on master, act as master failure]

The master, and other nodes in the cluster, were just fine.

About 8 seconds later, it found the master again.

DenisUspenskiy commented 10 years ago

Hello, Also have the same problem. Below is the exception stack trace:

[2013-12-24 14:56:53,082][DEBUG][action.bulk] [Trinity] [agentsmith][10], node[gYnN_-hxQly-2bctN2IVkg], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.bulk.BulkShardRequest@37daa067] java.lang.NullPointerException at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:138) at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:75) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:610) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:557) at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:724)

spinscale commented 10 years ago

@DenisUspenskiy did you have some cluster instability as well? Do you have master reelections around that time? Can you reproduce it?

spinscale commented 10 years ago

Clinton managed to reproduce it in #4693:

This is reproducible by deleting an index, not waiting for the response, then trying to bulk index into that index, (ie the requests were run in parallel):

ashpynov commented 10 years ago

Same call stack here. On data node such backlog while bulk indexing during master had been restarted. After this some shards on affected index on data node became and stay Unassigned while other is OK. (Index allocation rule is only on data node, no replica, 10 shards). Data node restart do not help. Only index drop. version is 0.90.7

cdmicacc commented 10 years ago

I think we're seeing this, as well. We get it when we close an index (using ElasticSearch 1.1.1): I have a process that is reindexing to a new index using the bulk API. While that is happening, my live system is still writing to the old index. Eventually, the reindex completes and the alias is changed so that writes are directed to the new index. Just after that, I close the old index. ElasticSearch's logs get filled with this for a short time (presumably while the threadpools drain):

[2014-06-17 17:28:30,986][INFO ][cluster.metadata         ] [es11] closing indices [[idx-2014-06-12-11]]
[2014-06-17 17:29:02,154][DEBUG][action.bulk              ] [es11] [idx-2014-06-12-11][4], node[1r5Z85J8TM2e1Tp3KO3NAA], [P], s[STARTED]: Failed to execute [org.elasticsearch.action.bulk.BulkShardRequest@7e39e699]
java.lang.NullPointerException
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:139)
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shards(TransportShardBulkAction.java:76)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performReplicas(TransportShardReplicationOperationAction.java:610)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction.performOnPrimary(TransportShardReplicationOperationAction.java:557)
        at org.elasticsearch.action.support.replication.TransportShardReplicationOperationAction$AsyncShardOperationAction$1.run(TransportShardReplicationOperationAction.java:426)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)
clintongormley commented 10 years ago

This NPE has been fixed in recent versions. The bulk API can still fail briefly with an index-does-not-exist exception, but this should be fixed by #6790.

Closing