Closed talevy closed 4 years ago
Pinging @elastic/ml-core (:ml)
@talevy Could you please add links to CI and the build scans?
Here’s one example:
Jenkins: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+g1gc/362/console Scan: https://gradle-enterprise.elastic.co/s/bdbipsekpfijw
Looks like it’s due to:
18:45:24 » ERROR][o.e.x.m.p.AbstractNativeProcess] [integTest-1] [regression_stop_and_restart] analytics process stopped unexpectedly: Input error: expected no more than '232' rows but got '350' rows. Please report this problem.
This is really unexpected. I'll look into how this could happen.
It just failed again, but on a different assertion and no message about the C++ process exiting early (unless I missed it - possible as I looked on a phone).
Details are:
I'll mute it for now.
I have understood why these failures may happen.
If the stopping occurs right after reindexing is finished but before we refresh the destination index, we don't refresh at all. The job is started again right after and jumps into the analyzing state. However, the data is still not searchable. This is why we see that we start the process expecting X rows (where X is lower than the expected number of docs) and we end up getting X+.
The fix is quite simple actually. There is no reason to perform the refresh at that part of the code. We can instead do it in AnalyticsProcessManager
before starting the process and ensure before we get the job running the dest index is fully searchable. I'm preparing a PR for it.
This still seems to be failing, in master
, 7.x
and 7.5
:
This seems to be a different failure:
org.elasticsearch.ElasticsearchException: Failed to launch data frame analytics memory usage estimation process for job regression_stop_and_restartOpen stacktrace
Caused by: java.io.FileNotFoundException: /dev/shm/elastic+elasticsearch+7.5+matrix-java-periodic/ES_BUILD_JAVA/openjdk12/ES_RUNTIME_JAVA/corretto11/nodes/general-purpose/x-pack/plugin/ml/qa/native-multi-node-tests/build/testclusters/integTest-1/tmp/data_frame_analyzer_regression_stop_and_restart_output_360144 (No such file or directory)Close stacktrace
at java.io.FileInputStream.open0(Native Method)
at java.io.FileInputStream.open(FileInputStream.java:219)
at java.io.FileInputStream.<init>(FileInputStream.java:157)
at java.io.FileInputStream.<init>(FileInputStream.java:112)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:288)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper$PrivilegedInputPipeOpener.run(NamedPipeHelper.java:277)
at java.security.AccessController.doPrivileged(Native Method)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:130)
at org.elasticsearch.xpack.ml.utils.NamedPipeHelper.openNamedPipeInputStream(NamedPipeHelper.java:97)
at org.elasticsearch.xpack.ml.process.ProcessPipes.connectStreams(ProcessPipes.java:141)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createNativeProcess(NativeMemoryUsageEstimationProcessFactory.java:100)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:67)
at org.elasticsearch.xpack.ml.dataframe.process.NativeMemoryUsageEstimationProcessFactory.createAnalyticsProcess(NativeMemoryUsageEstimationProcessFactory.java:34)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:77)
at org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:47)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:834)
@przemekwitek Could you take a look at this?
@przemekwitek Could you take a look at this?
Sure
Another instance https://gradle-enterprise.elastic.co/s/dakhdcri2celg/tests/zcquf3hc3eoda-q3ovxh6ju37ic
This failure is likely unrelated as it affects outlier detection analysis, not regression analysis.
A number of fixes were applied and the last CI failure of this test was a week ago. https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(_source),filters:!(('$state':(store:appState),meta:(alias:!n,disabled:!f,index:e58bf320-7efd-11e8-bf69-63c8ef516157,key:branch,negate:!t,params:(query:move-jobs,type:phrase),type:phrase,value:move-jobs),query:(match:(branch:(query:move-jobs,type:phrase))))),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22RegressionIT%20testStopAndRestart%22'),sort:!(time,desc))
Closing this issue for now.
Something has triggered this test to fail 14 times in CI, starting on October 2.
reproduction step:
Failed with stacktrace: