elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

[CI] Failure in ml.integration.RegressionIT.testStopAndRestart assertion #49095

Closed przemekwitek closed 4 years ago

przemekwitek commented 4 years ago

After I unmuted the test, there were 2 CI failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-8&&immutable/389/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/389/console

With small changes in the code I was able to reproduce the issue locally today. Here is the test log:

Suite: Test class org.elasticsearch.xpack.ml.integration.RegressionIT
  2> REPRODUCE WITH: ./gradlew ':x-pack:plugin:ml:qa:native-multi-node-tests:integTestRunner' --tests "org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart" -Dtests.seed=2454927CB657C116 -Dtests.security.manager=true -Dtests.locale=th -Dtests.timezone=America/Indiana/Vevay -Dcompiler.java=12 -Druntime.java=12
  2> java.lang.AssertionError: Hits were: {"hits":{"total":{"value":2,"relation":"eq"},"max_score":0.0,"hits":[{"_index":".ml-inference-000001","_id":"regression_stop_and_restart-1573742320992","_score":0.0,"_source":{"model_id":"regression_stop_and_restart-1573742320992","created_by":"data-frame-analytics","version":"8.0.0","create_time":1573742320992,"tags":["regression_stop_and_restart"],"metadata":{"analytics_config":{"id":"regression_stop_and_restart","source":{"index":["regression_stop_and_restart_source_index"],"query":{"match_all":{}}},"dest":{"index":"regression_stop_and_restart_source_index_results","results_field":"ml"},"analysis":{"regression":{"dependent_variable":"variable","prediction_field_name":"variable_prediction","training_percent":100.0}},"model_memory_limit":"1gb","create_time":1573742309034,"version":"8.0.0","allow_lazy_start":false}},"doc_type":"trained_model_config","input":{"field_names":["feature","variable"]}}},{"_index":".ml-inference-000001","_id":"regression_stop_and_restart-1573742322238","_score":0.0,"_source":{"model_id":"regression_stop_and_restart-1573742322238","created_by":"data-frame-analytics","version":"8.0.0","create_time":1573742322238,"tags":["regression_stop_and_restart"],"metadata":{"analytics_config":{"id":"regression_stop_and_restart","source":{"index":["regression_stop_and_restart_source_index"],"query":{"match_all":{}}},"dest":{"index":"regression_stop_and_restart_source_index_results","results_field":"ml"},"analysis":{"regression":{"dependent_variable":"variable","prediction_field_name":"variable_prediction","training_percent":100.0}},"model_memory_limit":"1gb","create_time":1573742309034,"version":"8.0.0","allow_lazy_start":false}},"doc_type":"trained_model_config","input":{"field_names":["feature","variable"]}}}]}}
    Expected: an array with size <1>
         but: array size was <2>
        at __randomizedtesting.SeedInfo.seed([2454927CB657C116:531C1082A2BDC7B6]:0)
        at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
        at org.junit.Assert.assertThat(Assert.java:956)
        at org.elasticsearch.xpack.ml.integration.MlNativeDataFrameAnalyticsIntegTestCase.assertInferenceModelPersisted(MlNativeDataFrameAnalyticsIntegTestCase.java:200)
        at org.elasticsearch.xpack.ml.integration.RegressionIT.testStopAndRestart(RegressionIT.java:309)
  2> NOTE: leaving temporary files on disk at: /Users/witek/github/elastic/elasticsearch/x-pack/plugin/ml/qa/native-multi-node-tests/build/testrun/integTestRunner/temp/org.elasticsearch.xpack.ml.integration.RegressionIT_2454927CB657C116-001
  2> Nov 14, 2019 3:38:51 PM com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
  2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
  2> NOTE: test params are: codec=Lucene80, sim=Asserting(org.apache.lucene.search.similarities.AssertingSimilarity@67df4402), locale=th, timezone=America/Indiana/Vevay
  2> NOTE: Mac OS X 10.14.5 x86_64/Oracle Corporation 12 (64-bit)/cpus=8,threads=1,free=226957824,total=536870912
  2> NOTE: All tests run in this JVM: [RegressionIT]
elasticmachine commented 4 years ago

Pinging @elastic/ml-core (:ml)

przemekwitek commented 4 years ago

There is interesting stuff happening in the logs (see below). It looks like despite the job being stopped, the result processor kept working. After a start call, another processor was created and both stored inference model. I'm going to investigate this more thoroughly now.

» [2019-11-14T15:38:32,616][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:34,650][INFO ][o.e.x.m.d.DataFrameAnalyticsManager] [integTest-0] [regression_stop_and_restart] Creating destination index [regression_stop_and_restart_source_index_results]
» [2019-11-14T15:38:37,445][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Stopping task with force [false]
» [2019-11-14T15:38:38,100][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:39,027][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:40,498][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:42,482][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742320992]
» [2019-11-14T15:38:42,671][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742322238]
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,770][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97981] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
»  ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":7092}
»  ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":1685}
»  ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":2}
»  ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97979] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
»  ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":21881}
»  ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":2770}
»  ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":12}
»  ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed
przemekwitek commented 4 years ago

I've just raised a PR (https://github.com/elastic/elasticsearch/pull/49167) to mute the problematic test as I've found it reproducible locally so no need to pollute CI with failures.

przemekwitek commented 4 years ago

I've found out that there can be a situation in which AnalyticsProcessManager.processContextByAllocation map contains two process context for the same job (both with 2 distinct keys - allocation ids). In this situation two result processors can run at the same time which causes them both to persist results.

I'll try to figure out how to fix this. Probably _stop should be fixed to cancel result processor or at least wait for it. Possibly, we should also throw/warn if we find out on _start that the process context for the job already exists (but then how do we know if it is the same job or the new job with the same name?). Anyway, this issue needs more thorough investigation.

dimitris-athanasiou commented 4 years ago

That's interesting. I'll have a look as well and we can discuss together Monday.

przemekwitek commented 4 years ago

Looking at CI build stats, the assertion that was failing before, did not fail on master and 7.x branches after #49282 was merged in. The only failed assertions are in move-jobs branch.

https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-30d,mode:quick,to:now))&_a=(columns:!(_source),filters:!(),index:e58bf320-7efd-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'%22RegressionIT%20testStopAndRestart%22'),sort:!(time,desc))