elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

non fatal npe in warmer #8140

Closed jillesvangurp closed 10 years ago

jillesvangurp commented 10 years ago

I've seen the stacktrace below in our integration tests a couple of time. We're starting elasticsearch as an embedded node. The error appears to be non fatal since our integration tests pass anyway. I've seen this stacktrace twice in the past week but can't reproduce it reliably.

We are running our maven tests concurrently and in randomized order, so there are a lot of integration tests hitting our elasticsearch node all at once right after it starts and reports a green status.

Using elasticsearch 1.4.0 Beta1

17-10-2014T15:56:40+0200 W warmer - [test-node-gstJI] [inbot_users_v27][2] failed to load random access for [_type:usercontact] org.elasticsearch.common.util.concurrent.UncheckedExecutionException: java.lang.NullPointerException at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2203) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache.get(LocalCache.java:3937) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4739) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.getAndLoadIfNotPresent(FixedBitSetFilterCache.java:132) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache.access$100(FixedBitSetFilterCache.java:75) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$FixedBitSetFilterWarmer$1.run(FixedBitSetFilterCache.java:284) ~[elasticsearch-1.4.0.Beta1.jar:na] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [na:1.8.0] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0] at java.lang.Thread.run(Thread.java:744) [na:1.8.0] Caused by: java.lang.NullPointerException: null at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:157) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.index.cache.fixedbitset.FixedBitSetFilterCache$2.call(FixedBitSetFilterCache.java:132) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4742) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3527) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2319) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282) ~[elasticsearch-1.4.0.Beta1.jar:na] at org.elasticsearch.common.cache.LocalCache$Segment.get(LocalCache.java:2197) ~[elasticsearch-1.4.0.Beta1.jar:na] ... 8 common frames omitted

clintongormley commented 10 years ago

thanks for reporting @jillesvangurp - we'll take a look

martijnvg commented 10 years ago

@jillesvangurp I think I found a small issue where there is a small window of time that a field is unset, but the warmer that needs is running. In order to confirm this can you share how you start the embedded node and run your test? (for example you wait for green status before running?) Sharing code snippets how the node is brought up before the tests run would even be more helpful.

jillesvangurp commented 10 years ago

Sure no problem:

            String defaultIndexDirectory = "target/data-"+UUID.randomUUID().toString();
            String indexDir = config.getString("estestserver.indexdir",defaultIndexDirectory);
            String logDir = config.getString("estestserver.logdir",defaultIndexDirectory + "/logs");
            String esPort = config.getString("estestserver.port","9299");
            File file = new File(defaultIndexDirectory);
            file.mkdirs();
            LOG.info("using " + file.getAbsolutePath() + " for es data and logs");
            Settings settings = ImmutableSettings.settingsBuilder()
                    .put("name", "test-node-"+RandomStringUtils.randomAlphabetic(5))
                    .put("cluster.name", "linko-dev-cluster-"+RandomStringUtils.randomAlphabetic(5))
                    .put("index.gateway.type", "none")
                    .put("gateway.type", "none")
                    .put("discovery.zen.ping.multicast.ping.enabled", "false")
                    .put("discovery.zen.ping.multicast.enabled", "false")
                    .put("path.data", indexDir)
                    .put("path.logs", logDir)
                    .put("foreground", "true")
                    .put("http.port", esPort)
                    .put("http.cors.enabled", "true")
                    .put("http.cors.allow-origin","/https?:\\/\\/(localhost|kibana.*\\.linko\\.io)(:[0-9]+)?/")
                    .build();

            LOG.info(settings.toDelimitedString(';'));

            NodeBuilder nodeBuilder = NodeBuilder.nodeBuilder()
                    .settings(settings)
                    .loadConfigSettings(false);
            node = nodeBuilder
                    .build();

            // register a shutdown hook
            Runtime.getRuntime().addShutdownHook(new Thread() {
                @Override
                public void run() {
                    node.close();
                }
            });
            node.start();

            // wait until the shards are ready
            node.client().admin().cluster().prepareHealth().setWaitForGreenStatus().execute().actionGet();
jillesvangurp commented 10 years ago

One additional bit of information that I just realized may be relevant here is that we have a parent child relation between user and usercontact. So, the exception is happening when it is doing something with the child type. For reference, here's a gist with the full mapping for the index: https://gist.github.com/jillesvangurp/d0cd29573b876f9cc4d3

jillesvangurp commented 10 years ago

Also, we're using testNg and surefire. The elasticsearch node is started in a @BeforeSuite. We use a very large threadcount and randomized order to surface any issues related to inter test dependencies and stability of our system. This pretty much means all our integration test classes are starting at the same time. We generally use randomized test data and there are a lot of calls to /_refresh to ensure indices are committed in each tests.

<plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <configuration>
                    <parallel>classes</parallel>
                    <threadCount>50</threadCount>
                    <runOrder>random</runOrder>
                    <argLine>-Xms1024m -Xmx2048m</argLine>
                    <properties>
                        <property>
                            <name>listener</name>
                            <value>io.linko.ng.testutil.TestProgressLogger</value>
                        </property>
                    </properties>
                </configuration>
            </plugin>
martijnvg commented 10 years ago

Thanks for the provided information @jillesvangurp

I opened: #8168 for this issue. Are you able to verify if the non fatal NPE doesn't occur any more with this fix in your test infrastructure?

jillesvangurp commented 10 years ago

No problem. Unfortunately, most of our builds don't trigger this exception; so it is a bit hard for me to confirm. I've only spotted it twice out of dozens of test runs over the past week. If you issue another beta, I'll be able to depend on that at least.