Closed dimitris-athanasiou closed 4 years ago
Pinging @elastic/ml-core (:ml)
Same thing in https://gradle-enterprise.elastic.co/s/ouvoly2gv5m4a
org.elasticsearch.client.ResponseException: method [DELETE], host [http://127.0.0.1:54989], URI [/_ml/data_frame/analytics/bar], status line [HTTP/1.1 503 Service Unavailable]
{"error":{"root_cause":[],"type":"search_phase_execution_exception","reason":"all shards failed","phase":"query","grouped":true,"failed_shards":[]},"status":503}
This happens in the cleanup code after the test completes, in MlRestTestStateCleaner.deleteAllDataFrameAnalytics()
.
Another occurrence: https://gradle-enterprise.elastic.co/s/ouvoly2gv5m4a
The server side log for that last failure shows this:
[2020-07-31T14:32:42,487][INFO ][o.e.x.c.m.u.MlIndexAndAlias] [integTest-0] About to create first concrete index [.ml-stats-000001] with alias [.ml-stats-write]
[2020-07-31T14:32:42,493][INFO ][o.e.c.m.MetadataCreateIndexService] [integTest-0] [.ml-stats-000001] creating index, cause [api], templates [.ml-stats], shards [1]/[1]
[2020-07-31T14:32:42,493][INFO ][o.e.c.r.a.AllocationService] [integTest-0] updating number_of_replicas to [0] for indices [.ml-stats-000001]
[2020-07-31T14:32:42,494][DEBUG][o.e.c.c.PublicationTransportHandler] [integTest-0] received diff cluster state version [5683] with uuid [OHsvjifSS6my7tN1ogOq5g], diff size [1107]
[2020-07-31T14:32:42,539][DEBUG][o.e.c.c.C.CoordinatorPublication] [integTest-0] publication ended successfully: Publication{term=1, version=5683}
[2020-07-31T14:32:42,540][INFO ][o.e.x.i.IndexLifecycleTransition] [integTest-0] moving index [.ml-stats-000001] from [null] to [{"phase":"new","action":"complete","name":"complete"}] in policy [ml-size-based-ilm-policy]
[2020-07-31T14:32:42,540][DEBUG][o.e.c.c.PublicationTransportHandler] [integTest-0] received diff cluster state version [5684] with uuid [sO5GrF17TWuywmGeF21c5A], diff size [502]
[2020-07-31T14:32:42,545][WARN ][r.suppressed ] [integTest-0] path: /_ml/data_frame/analytics/bar, params: {id=bar}
org.elasticsearch.action.search.SearchPhaseExecutionException: all shards failed
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:554) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:309) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:585) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.onShardFailure(AbstractSearchAsyncAction.java:393) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction.access$100(AbstractSearchAsyncAction.java:68) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.AbstractSearchAsyncAction$1.onFailure(AbstractSearchAsyncAction.java:245) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.SearchExecutionStatsCollector.onFailure(SearchExecutionStatsCollector.java:73) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:59) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.search.SearchTransportService$ConnectionCountingHandler.handleException(SearchTransportService.java:403) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TransportService$6.handleException(TransportService.java:638) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1172) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1281) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1255) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:61) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.transport.TransportChannel.sendErrorResponse(TransportChannel.java:56) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:51) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.search.SearchService.lambda$runAsync$0(SearchService.java:414) ~[elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:710) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.10.0-SNAPSHOT.jar:7.10.0-SNAPSHOT]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
Therefore it seems likely that the problem happens when deleting the job stats while the stats index isn't yet ready. This also ties up with "failed_shards":[]
in the exception message - all shards failed yet no failed shards implies no shards at all were queried.
I think the problem may be coming from the async nature of the TrainedModelStatsService
.
Suppose a test does something that triggers the TrainedModelStatsService
to start operating. That will then continue to operate in the background while other tests run.
The cleanup in between tests will delete all the indices. TrainedModelStatsService
will put the .ml-stats
index back as soon as it notices it's gone. This may then catch out other tests that didn't expect it to exist.
In the most recent failure mentioned in this issue the failed test is Test get stats given multiple analytics
. That doesn't actually start any jobs, so you would not expect the .ml-stats
index to exist at the end of it. Data frame analytics jobs create the stats index when they are started, and the start endpoint won't return until the index is in yellow state. So something else must have created the .ml-stats
index while the test was running and TrainedModelStatsService
is the prime suspect.
Looking at the run up to the failure, Test create and delete pipeline with inference processor
ran about 1 second before it. In TrainedModelStatsService
PERSISTENCE_INTERVAL = TimeValue.timeValueSeconds(1)
, so that would explain why the stats index got created during Test get stats given multiple analytics
.
1> [2020-07-31T16:32:41,562][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/inference_processor/Test create and delete pipeline with inference processor] before test |
1> [2020-07-31T16:32:42,009][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/inference_processor/Test create and delete pipeline with inference processor] after test |
1> [2020-07-31T16:32:42,019][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/jobs_crud/Test put job with inconsistent model snapshot settings] before test |
1> [2020-07-31T16:32:42,037][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/jobs_crud/Test put job with inconsistent model snapshot settings] after test |
1> [2020-07-31T16:32:42,047][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/data_frame_analytics_crud/Test get stats given multiple analytics] before test |
1> [2020-07-31T16:32:42,733][INFO ][o.e.s.MlWithSecurityIT ] [test] [yaml=ml/data_frame_analytics_crud/Test get stats given multiple analytics] after test |
2> REPRODUCE WITH: gradlew ':x-pack:plugin:ml:qa:ml-with-security:integTestRunner' --tests "org.elasticsearch.smoketest.MlWithSecurityIT.test {yaml=ml/data_frame_analytics_crud/Test get stats given multiple analytics}" -Dtests.seed=AFA936579D314669 -Dtests.security.manager=true -Dtests.locale=pt-BR -Dtests.timezone=Europe/Amsterdam -Druntime.java=8 -Dtests.rest.blacklist=ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code,ml/calendar_crud/Test get calendar given missing,ml/calendar_crud/Test cannot create calendar with name _all,ml/calendar_crud/Test PageParams with ID is invalid,ml/calendar_crud/Test post calendar events given empty events,ml/calendar_crud/Test put calendar given id contains invalid chars,ml/calendar_crud/Test delete event from non existing calendar,ml/calendar_crud/Test delete job from non existing calendar,ml/custom_all_field/Test querying custom all field,ml/datafeeds_crud/Test delete datafeed with missing id,ml/datafeeds_crud/Test put datafeed referring to missing job_id,ml/datafeeds_crud/Test put datafeed with invalid query,ml/datafeeds_crud/Test put datafeed with security headers in the body,ml/datafeeds_crud/Test update datafeed with missing id,ml/data_frame_analytics_crud/Test put config with security headers in the body,ml/data_frame_analytics_crud/Test put config with create_time in the body,ml/data_frame_analytics_crud/Test put config with version in the body,ml/data_frame_analytics_crud/Test put config with inconsistent body/param ids,ml/data_frame_analytics_crud/Test put config with invalid id,ml/data_frame_analytics_crud/Test put config with invalid dest index name,ml/data_frame_analytics_crud/Test put config with pattern dest index name,ml/data_frame_analytics_crud/Test put config with missing concrete source index,ml/data_frame_analytics_crud/Test put config with missing wildcard source index,ml/data_frame_analytics_crud/Test put config with dest index same as source index,ml/data_frame_analytics_crud/Test put config with dest index matching multiple indices,ml/data_frame_analytics_crud/Test put config with dest index included in source via alias,ml/data_frame_analytics_crud/Test put config with remote source index,ml/data_frame_analytics_crud/Test put config with unknown top level field,ml/data_frame_analytics_crud/Test put config with unknown field in outlier detection analysis,ml/data_frame_analytics_crud/Test put config given analyzed_fields include field excluded by source,ml/data_frame_analytics_crud/Test put config given missing source,ml/data_frame_analytics_crud/Test put config given source with empty index array,ml/data_frame_analytics_crud/Test put config given source with empty string in index array,ml/data_frame_analytics_crud/Test put config given source without index,ml/data_frame_analytics_crud/Test put config given missing dest,ml/data_frame_analytics_crud/Test put config given dest index contains uppercase chars,ml/data_frame_analytics_crud/Test put config given dest with empty index,ml/data_frame_analytics_crud/Test put config given dest without index,ml/data_frame_analytics_crud/Test put config given missing analysis,ml/data_frame_analytics_crud/Test put config given empty analysis,ml/data_frame_analytics_crud/Test max model memory limit,ml/data_frame_analytics_crud/Test put outlier_detection given n_neighbors is negative,ml/data_frame_analytics_crud/Test put outlier_detection given n_neighbors is zero,ml/data_frame_analytics_crud/Test put outlier_detection given feature_influence_threshold is negative,ml/data_frame_analytics_crud/Test put outlier_detection given feature_influence_threshold is greater than one,ml/data_frame_analytics_crud/Test put outlier_detection given outlier_fraction is negative,ml/data_frame_analytics_crud/Test put outlier_detection given outlier_fraction is greater than one,ml/data_frame_analytics_crud/Test put regression given dependent_variable is not defined,ml/data_frame_analytics_crud/Test put regression given negative lambda,ml/data_frame_analytics_crud/Test put regression given negative gamma,ml/data_frame_analytics_crud/Test put regression given eta less than 1e-3,ml/data_frame_analytics_crud/Test put regression given eta greater than one,ml/data_frame_analytics_crud/Test put regression given max_trees is zero,ml/data_frame_analytics_crud/Test put regression given max_trees is greater than 2k,ml/data_frame_analytics_crud/Test put regression given feature_bag_fraction is negative,ml/data_frame_analytics_crud/Test put regression given feature_bag_fraction is greater than one,ml/data_frame_analytics_crud/Test put regression given training_percent is less than one,ml/data_frame_analytics_crud/Test put regression given training_percent is greater than hundred,ml/data_frame_analytics_crud/Test put regression given loss_function_parameter is zero,ml/data_frame_analytics_crud/Test put regression given loss_function_parameter is negative,ml/data_frame_analytics_crud/Test put classification given dependent_variable is not defined,ml/data_frame_analytics_crud/Test put classification given negative lambda,ml/data_frame_analytics_crud/Test put classification given negative gamma,ml/data_frame_analytics_crud/Test put classification given eta less than 1e-3,ml/data_frame_analytics_crud/Test put classification given eta greater than one,ml/data_frame_analytics_crud/Test put classification given max_trees is zero,ml/data_frame_analytics_crud/Test put classification given max_trees is greater than 2k,ml/data_frame_analytics_crud/Test put classification given feature_bag_fraction is negative,ml/data_frame_analytics_crud/Test put classification given feature_bag_fraction is greater than one,ml/data_frame_analytics_crud/Test put classification given num_top_classes is less than zero,ml/data_frame_analytics_crud/Test put classification given num_top_classes is greater than 1k,ml/data_frame_analytics_crud/Test put classification given training_percent is less than one,ml/data_frame_analytics_crud/Test put classification given training_percent is greater than hundred,ml/estimate_model_memory/Test missing overall cardinality,ml/estimate_model_memory/Test missing max bucket cardinality,ml/evaluate_data_frame/Test given missing index,ml/evaluate_data_frame/Test given index does not exist,ml/evaluate_data_frame/Test given missing evaluation,ml/evaluate_data_frame/Test outlier_detection auc_roc given actual_field is always true,ml/evaluate_data_frame/Test outlier_detection auc_roc given actual_field is always false,ml/evaluate_data_frame/Test outlier_detection given evaluation with empty metrics,ml/evaluate_data_frame/Test outlier_detection given missing actual_field,ml/evaluate_data_frame/Test outlier_detection given missing predicted_probability_field,ml/evaluate_data_frame/Test outlier_detection given precision with threshold less than zero,ml/evaluate_data_frame/Test outlier_detection given recall with threshold less than zero,ml/evaluate_data_frame/Test outlier_detection given confusion_matrix with threshold less than zero,ml/evaluate_data_frame/Test outlier_detection given precision with empty thresholds,ml/evaluate_data_frame/Test outlier_detection given recall with empty thresholds,ml/evaluate_data_frame/Test outlier_detection given confusion_matrix with empty thresholds,ml/evaluate_data_frame/Test classification given evaluation with empty metrics,ml/evaluate_data_frame/Test classification given missing actual_field,ml/evaluate_data_frame/Test classification given missing predicted_field,ml/evaluate_data_frame/Test regression given evaluation with empty metrics,ml/evaluate_data_frame/Test regression given missing actual_field,ml/evaluate_data_frame/Test regression given missing predicted_field,ml/explain_data_frame_analytics/Test neither job id nor body,ml/explain_data_frame_analytics/Test both job id and body,ml/explain_data_frame_analytics/Test missing job,ml/explain_data_frame_analytics/Test empty data frame given body,ml/delete_job_force/Test cannot force delete a non-existent job,ml/delete_model_snapshot/Test delete snapshot missing snapshotId,ml/delete_model_snapshot/Test delete snapshot missing job_id,ml/delete_model_snap 2> shot/Test delete with in-use model,ml/filter_crud/Test create filter api with mismatching body ID,ml/filter_crud/Test create filter given invalid filter_id,ml/filter_crud/Test get filter API with bad ID,ml/filter_crud/Test non-existing filter,ml/filter_crud/Test update filter given remove item is not present,ml/filter_crud/Test get all filter given index exists but no mapping for filter_id,ml/get_datafeed_stats/Test get datafeed stats given missing datafeed_id,ml/get_datafeeds/Test get datafeed given missing datafeed_id,ml/inference_crud/Test delete given used trained model,ml/inference_crud/Test delete with missing model,ml/inference_crud/Test get given missing trained model,ml/inference_crud/Test get given expression without matches and allow_no_match is false,ml/inference_crud/Test put ensemble with empty models,ml/inference_crud/Test put ensemble with tree where tree has out of bounds feature_names index,ml/inference_crud/Test put model with empty input.field_names,ml/inference_crud/Test PUT model where target type and inference config mismatch,ml/inference_processor/Test create processor with missing mandatory fields,ml/inference_stats_crud/Test get stats given missing trained model,ml/inference_stats_crud/Test get stats given expression without matches and allow_no_match is false,ml/jobs_crud/Test cannot create job with existing categorizer state document,ml/jobs_crud/Test cannot create job with existing quantiles document,ml/jobs_crud/Test cannot create job with existing result document,ml/jobs_crud/Test cannot create job with model snapshot id set,ml/jobs_crud/Test cannot decrease model_memory_limit below current usage,ml/jobs_crud/Test get job API with non existing job id,ml/jobs_crud/Test put job after closing results index,ml/jobs_crud/Test put job after closing state index,ml/jobs_crud/Test put job with inconsistent body/param ids,ml/jobs_crud/Test put job with inconsistent model snapshot settings,ml/jobs_crud/Test put job with time field in analysis_config,ml/jobs_crud/Test put job with duplicate detector configurations,ml/jobs_crud/Test job with categorization_analyzer and categorization_filters,ml/jobs_get/Test get job given missing job_id,ml/jobs_get_result_buckets/Test mutually-exclusive params,ml/jobs_get_result_buckets/Test mutually-exclusive params via body,ml/jobs_get_result_categories/Test with invalid param combinations,ml/jobs_get_result_categories/Test with invalid param combinations via body,ml/jobs_get_result_overall_buckets/Test overall buckets given missing job,ml/jobs_get_result_overall_buckets/Test overall buckets given non-matching expression and not allow_no_jobs,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is 0,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is negative,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid start param,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid end param,ml/jobs_get_result_overall_buckets/Test overall buckets given bucket_span is smaller than max job bucket_span,ml/jobs_get_stats/Test get job stats given missing job,ml/jobs_get_stats/Test no exception on get job stats with missing index,ml/job_groups/Test put job with empty group,ml/job_groups/Test put job with group that matches an job id,ml/job_groups/Test put job with group that matches its id,ml/job_groups/Test put job with id that matches an existing group,ml/job_groups/Test put job with invalid group,ml/ml_info/Test ml info,ml/pipeline_inference/Test setting results field is invalid,ml/post_data/Test Flush data with invalid parameters,ml/post_data/Test flushing and posting a closed job,ml/post_data/Test open and close with non-existent job id,ml/post_data/Test POST data with invalid parameters,ml/preview_datafeed/Test preview missing datafeed,ml/revert_model_snapshot/Test revert model with invalid snapshotId,ml/start_data_frame_analytics/Test start given missing source index,ml/start_data_frame_analytics/Test start outlier_detection given source index has no fields,ml/start_data_frame_analytics/Test start regression given source index only has dependent variable,ml/start_data_frame_analytics/Test start with inconsistent body/param ids,ml/start_data_frame_analytics/Test start given dest index is not empty,ml/start_data_frame_analytics/Test start with compatible fields but no data,ml/start_stop_datafeed/Test start datafeed job, but not open,ml/start_stop_datafeed/Test start non existing datafeed,ml/start_stop_datafeed/Test stop non existing datafeed,ml/update_model_snapshot/Test without description,ml/validate/Test invalid job config,ml/validate/Test job config is invalid because model snapshot id set,ml/validate/Test job config that is invalid only because of the job ID,ml/validate/Test job config with duplicate detector configurations,ml/validate_detector/Test invalid detector,ml/delete_forecast/Test delete on _all forecasts not allow no forecasts,ml/delete_forecast/Test delete forecast on missing forecast,ml/set_upgrade_mode/Attempt to open job when upgrade_mode is enabled,ml/set_upgrade_mode/Setting upgrade_mode to enabled,ml/set_upgrade_mode/Setting upgrade mode to disabled from enabled,ml/set_upgrade_mode/Test setting upgrade_mode to false when it is already false
We have had situations before where the test harness and the server code fight each other to delete and recreate indices and/or index templates. I am not sure what the solution is in this case. @benwtrent is there a way we can cancel the TrainedModelStatsService
in the cleanup that runs in between YAML tests so that it's only running in tests that actually test inference? Another alternative might be to tell the cleanup in between tests not to delete the .ml-stats*
indices, any change any tests that want to assert on the contents of it to explicitly delete all documents from it as their first step. Maybe there is a solution that is better still.
@droberts195 would better integration with node lifecycle fix it? Or are these nodes the same between tests?
Or are these nodes the same between tests?
It's a single node for the whole test suite.
If you download https://storage.cloud.google.com/elasticsearch-ci-artifacts/jobs/elastic+elasticsearch+7.x+multijob-windows-compatibility/os=windows-2012-r2/build/132.tar.bz2, expand it and then look in x-pack/plugin/ml/qa/ml-with-security/build/testclusters/integTest-0/logs/integTest.log
you can see everything that node did during the test suite. You can see it ran for 11 minutes.
It would be too slow if a separate JVM had to be spun up for every YAML test.
@droberts195 what if delete analytics job didn't throw when there was a problem deleting from the stats index?
I suppose at that point there is the worry of leftover stats documents :/ But we would have this issue in production anyways if a user ran into this.
is there a way we can cancel the TrainedModelStatsService
Yaml tests require all interactions are done via a REST API. So, if we added something like this, it would have to be an API call.
There is also no way to wait for the stats service to become idle (outside of a separate API call).
Ingest pipeline deletion is is a cluster state removal action. Then other things are triggered off of that, so there is no way to put a wait
there.
Another failure https://gradle-enterprise.elastic.co/s/2p4o7efd3lmrk
what if delete analytics job didn't throw when there was a problem deleting from the stats index?
We discussed this in the team meeting and agreed to do this.
Plus we'll also clean up orphaned stats documents in the nightly maintenance like we clean up orphaned state documents.
Two tests failed with the same reason during deletion of a data frame analytics job:
p0=ml/start_data_frame_analytics/Test start with compatible fields but no data
Build scan: https://gradle-enterprise.elastic.co/s/3vyzs4yrmork2
Repro line:
Failure:
yaml=ml/data_frame_analytics_crud/Test update given all updatable settings
Build scan: https://gradle-enterprise.elastic.co/s/zaqfxhphcg72g
Repro line:
./gradlew ':x-pack:plugin:ml:qa:ml-with-security:integTestRunner' --tests "org.elasticsearch.smoketest.MlWithSecurityIT.test {yaml=ml/data_frame_analytics_crud/Test update given all updatable settings}" -Dtests.seed=C15EB5D34B41D230
Failure: