Closed przemekwitek closed 4 years ago
Pinging @elastic/ml-core (:ml)
There is interesting stuff happening in the logs (see below). It looks like despite the job being stopped, the result processor kept working. After a start call, another processor was created and both stored inference model. I'm going to investigate this more thoroughly now.
» [2019-11-14T15:38:32,616][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:34,650][INFO ][o.e.x.m.d.DataFrameAnalyticsManager] [integTest-0] [regression_stop_and_restart] Creating destination index [regression_stop_and_restart_source_index_results]
» [2019-11-14T15:38:37,445][INFO ][o.e.x.m.a.TransportStopDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Stopping task with force [false]
» [2019-11-14T15:38:38,100][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:39,027][INFO ][o.e.x.m.a.TransportStartDataFrameAnalyticsAction] [integTest-0] [regression_stop_and_restart] Starting data frame analytics
» [2019-11-14T15:38:40,498][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Waiting for result processor to complete
» [2019-11-14T15:38:42,482][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742320992]
» [2019-11-14T15:38:42,671][INFO ][o.e.x.m.d.p.AnalyticsResultProcessor] [integTest-0] [regression_stop_and_restart] Stored trained model with id [regression_stop_and_restart-1573742322238]
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,767][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Result processor has completed
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closing process
» [2019-11-14T15:38:43,770][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97981] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
» ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":7092}
» ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":1685}
» ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":2}
» ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.l.CppLogMessageHandler] [integTest-0] [regression_stop_and_restart] [data_frame_analyzer/97979] [Main.cc@226] [{"name":"E_DFTPMEstimatedPeakMemoryUsage","description":"The upfront estimate of the peak memory training the predictive model would use","value":3384615}
» ,{"name":"E_DFTPMPeakMemoryUsage","description":"The peak memory training the predictive model used","value":21881}
» ,{"name":"E_DFTPMTimeToTrain","description":"The time it took to train the predictive model","value":2770}
» ,{"name":"E_DFTPMTrainedForestNumberTrees","description":"The total number of trees in the trained forest","value":12}
» ]
» [2019-11-14T15:38:43,769][INFO ][o.e.x.m.p.AbstractNativeProcess] [integTest-0] [regression_stop_and_restart] State output finished
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Closed process
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed
» [2019-11-14T15:38:43,771][INFO ][o.e.x.m.d.p.AnalyticsProcessManager] [integTest-0] [regression_stop_and_restart] Marking task completed
I've just raised a PR (https://github.com/elastic/elasticsearch/pull/49167) to mute the problematic test as I've found it reproducible locally so no need to pollute CI with failures.
I've found out that there can be a situation in which AnalyticsProcessManager.processContextByAllocation
map contains two process context for the same job (both with 2 distinct keys - allocation ids). In this situation two result processors can run at the same time which causes them both to persist results.
I'll try to figure out how to fix this. Probably _stop
should be fixed to cancel result processor or at least wait for it. Possibly, we should also throw/warn if we find out on _start
that the process context for the job already exists (but then how do we know if it is the same job or the new job with the same name?).
Anyway, this issue needs more thorough investigation.
That's interesting. I'll have a look as well and we can discuss together Monday.
Looking at CI build stats, the assertion that was failing before, did not fail on master
and 7.x
branches after #49282 was merged in.
The only failed assertions are in move-jobs
branch.
After I unmuted the test, there were 2 CI failures: https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=debian-8&&immutable/389/console https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=amazon/389/console
With small changes in the code I was able to reproduce the issue locally today. Here is the test log: