elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.08k stars 24.84k forks source link

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart #47612

Closed talevy closed 4 years ago

talevy commented 5 years ago

Something has triggered this test to fail 14 times in CI, starting on October 2.

reproduction step:

./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" -Dtests.seed=6C78F050FC5C2A2

Failed with stacktrace:

java.lang.AssertionError: 

Expected: <stopped>
     but: was <failed>

Open stacktrace

[2019-10-04T16:15:56,253][ERROR][o.e.x.m.i.RegressionIT   ] [testStopAndRestart] Failed to stop data frame analytics jobs; trying force
org.elasticsearch.ElasticsearchStatusException: cannot close data frame analytics [regression_stop_and_restart] because it failed, use force stop instead
    at org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.conflictStatusException(ExceptionsHelper.java:59) ~[x-pack-core-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.findAnalyticsToStop(TransportStopDataFrameAnalyticsAction.java:127) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$doExecute$1(TransportStopDataFrameAnalyticsAction.java:102) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.xpack.ml.action.TransportStopDataFrameAnalyticsAction.lambda$expandIds$4(TransportStopDataFrameAnalyticsAction.java:173) ~[x-pack-ml-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
    at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
elasticmachine commented 5 years ago

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou commented 5 years ago

@talevy Could you please add links to CI and the build scans?

droberts195 commented 5 years ago

Here’s one example:

Jenkins: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/362/console Scan: https://gradle-enterprise.elastic.co/s/bdbipsekpfijw

droberts195 commented 5 years ago

Looks like it’s due to:

18:45:24 » ERROR][o.e.x.m.p.AbstractNativeProcess] [integTest-1] [regression_stop_and_restart] analytics process stopped unexpectedly: Input error: expected no more than '232' rows but got '350' rows. Please report this problem.
dimitris-athanasiou commented 5 years ago

This is really unexpected. I'll look into how this could happen.

droberts195 commented 5 years ago

It just failed again, but on a different assertion and no message about the C++ process exiting early (unless I missed it - possible as I looked on a phone).

Details are:

dimitris-athanasiou commented 5 years ago

I'll mute it for now.

dimitris-athanasiou commented 5 years ago

I have understood why these failures may happen.

If the stopping occurs right after reindexing is finished but before we refresh the destination index, we don't refresh at all. The job is started again right after and jumps into the analyzing state. However, the data is still not searchable. This is why we see that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+.

The fix is quite simple actually. There is no reason to perform the refresh at that part of the code. We can instead do it in AnalyticsProcessManager before starting the process and ensure before we get the job running the dest index is fully searchable. I'm preparing a PR for it.

DaveCTurner commented 5 years ago

This still seems to be failing, in master, 7.x and 7.5:

dimitris-athanasiou commented 5 years ago

This seems to be a different failure:

org.elasticsearch.ElasticsearchException: Failed to launch data frame analytics memory usage estimation process for job regression_stop_and_restartOpen stacktrace
Caused by: java.io.FileNotFoundException: /dev/shm/elastic+elasticsearch+7.5+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/corretto11/nodes/general-purpose/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-1/tmp/data_frame_analyzer_regression_stop_and_restart_output_360144 (No such file or directory)Close stacktrace
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:219)
at java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.io.FileInputStream.<init>(FileInputStream.java:112)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97)
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectStreams(ProcessPipes.java:141)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createNativeProcess(NativeMemoryUsageEstimationProcessFactory.java:100)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:67)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:34)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:77)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:47)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)

@przemekwitek Could you take a look at this?

przemekwitek commented 5 years ago

@przemekwitek Could you take a look at this?

Sure

dnhatn commented 4 years ago

Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic

przemekwitek commented 4 years ago

Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic

This failure is likely unrelated as it affects outlier detection analysis, not regression analysis.

przemekwitek commented 4 years ago

A number of fixes were applied and the last CI failure of this test was a week ago. https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:e58bf320-7efd-11e8-bf69-63c8ef516157,key:branch,negate:!t,params:(query:move-jobs,type:phrase),type:phrase,value:move-jobs),query:(match:(branch:(query:move-jobs,type:phrase))))),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22RegressionIT%20testStopAndRestart%22'),sort:!(time,desc))

Closing this issue for now.