elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.08k stars 24.84k forks source link

MlWithSecurityIT data frame analytics tests failures #45787

Closed davidkyle closed 5 years ago

davidkyle commented 5 years ago

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob+fast+part2/866/console https://gradle-enterprise.elastic.co/s/2w3kg7bljpnlc

java.lang.AssertionError: Failure at [ml/data_frame_analytics_memory_usage_estimation:33]: expected [2xx] status code but api [ml.estimate_memory_usage] returned [500 Internal Server Error] [---
error:
  root_cause:
  - type: "exception"
    reason: "[memory_usage_estimation_20487] Error while starting process"
    stack_trace: "ElasticsearchException[[memory_usage_estimation_20487] Error while\
      \ starting process]\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:47)\n\
      \tat org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:87)\n\
      \tat org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:49)\n\
      \tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)\n\
      \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\
      \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\
      \tat java.base/java.lang.Thread.run(Thread.java:834)\n"
  type: "exception"
  reason: "[memory_usage_estimation_20487] Error while starting process"
  stack_trace: "ElasticsearchException[[memory_usage_estimation_20487] Error while\
    \ starting process]\n\tat org.elasticsearch.xpack.core.ml.utils.ExceptionsHelper.serverError(ExceptionsHelper.java:47)\n\
    \tat org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.runJob(MemoryUsageEstimationProcessManager.java:87)\n\
    \tat org.elasticsearch.xpack.ml.dataframe.process.MemoryUsageEstimationProcessManager.lambda$runJobAsync$0(MemoryUsageEstimationProcessManager.java:49)\n\
    \tat org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:699)\n\
    \tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\
    \tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\
    \tat java.base/java.lang.Thread.run(Thread.java:834)\n"
status: 500

Does not reproduce:

/gradlew :x-pack:plugin:ml:qa:ml-with-security:integTestRunner --tests "org.elasticsearch.smoketest.MlWithSecurityIT.test {yaml=ml/data_frame_analytics_memory_usage_estimation/Test memory usage estimation for non-empty data frame}" \
  -Dtests.seed=6EC2DEEC7AAC9827 \
  -Dtests.security.manager=true \
  -Dtests.locale=se-FI \
  -Dtests.timezone=Asia/Kuala_Lumpur \
  -Dcompiler.java=12 \
  -Druntime.java=11 \
  -Dtests.rest.blacklist=ml/ml_classic_analyze/Test analyze API with an analyzer that does what we used to do in native code,ml/calendar_crud/Test get calendar given missing,ml/calendar_crud/Test cannot create calendar with name _all,ml/calendar_crud/Test PageParams with ID is invalid,ml/calendar_crud/Test post calendar events given empty events,ml/calendar_crud/Test put calendar given id contains invalid chars,ml/calendar_crud/Test delete event from non existing calendar,ml/calendar_crud/Test delete job from non existing calendar,ml/custom_all_field/Test querying custom all field,ml/datafeeds_crud/Test delete datafeed with missing id,ml/datafeeds_crud/Test put datafeed referring to missing job_id,ml/datafeeds_crud/Test put datafeed with invalid query,ml/datafeeds_crud/Test put datafeed with security headers in the body,ml/datafeeds_crud/Test update datafeed with missing id,ml/data_frame_analytics_crud/Test put config with security headers in the body,ml/data_frame_analytics_crud/Test put config with create_time in the body,ml/data_frame_analytics_crud/Test put config with version in the body,ml/data_frame_analytics_crud/Test put config with inconsistent body/param ids,ml/data_frame_analytics_crud/Test put config with invalid id,ml/data_frame_analytics_crud/Test put config with invalid dest index name,ml/data_frame_analytics_crud/Test put config with pattern dest index name,ml/data_frame_analytics_crud/Test put config with missing concrete source index,ml/data_frame_analytics_crud/Test put config with missing wildcard source index,ml/data_frame_analytics_crud/Test put config with dest index same as source index,ml/data_frame_analytics_crud/Test put config with dest index matching multiple indices,ml/data_frame_analytics_crud/Test put config with dest index included in source via alias,ml/data_frame_analytics_crud/Test put config with unknown top level field,ml/data_frame_analytics_crud/Test put config with unknown field in outlier detection analysis,ml/data_frame_analytics_crud/Test put config given missing source,ml/data_frame_analytics_crud/Test put config given source with empty index array,ml/data_frame_analytics_crud/Test put config given source with empty string in index array,ml/data_frame_analytics_crud/Test put config given source without index,ml/data_frame_analytics_crud/Test put config given missing dest,ml/data_frame_analytics_crud/Test put config given dest with empty index,ml/data_frame_analytics_crud/Test put config given dest without index,ml/data_frame_analytics_crud/Test put config given missing analysis,ml/data_frame_analytics_crud/Test put config given empty analysis,ml/data_frame_analytics_crud/Test max model memory limit,ml/data_frame_analytics_crud/Test put regression given dependent_variable is not defined,ml/data_frame_analytics_crud/Test put regression given negative lambda,ml/data_frame_analytics_crud/Test put regression given negative gamma,ml/data_frame_analytics_crud/Test put regression given eta less than 1e-3,ml/data_frame_analytics_crud/Test put regression given eta greater than one,ml/data_frame_analytics_crud/Test put regression given maximum_number_trees is zero,ml/data_frame_analytics_crud/Test put regression given maximum_number_trees is greater than 2k,ml/data_frame_analytics_crud/Test put regression given feature_bag_fraction is negative,ml/data_frame_analytics_crud/Test put regression given feature_bag_fraction is greater than one,ml/evaluate_data_frame/Test given missing index,ml/evaluate_data_frame/Test given index does not exist,ml/evaluate_data_frame/Test given missing evaluation,ml/evaluate_data_frame/Test binary_soft_classification auc_roc given actual_field is always true,ml/evaluate_data_frame/Test binary_soft_classification auc_roc given actual_field is always false,ml/evaluate_data_frame/Test binary_soft_classification given evaluation with empty metrics,ml/evaluate_data_frame/Test binary_soft_classification given missing actual_field,ml/evaluate_data_frame/Test binary_soft_classification given missing predicted_probability_field,ml/evaluate_data_frame/Test binary_soft_classification given precision with threshold less than zero,ml/evaluate_data_frame/Test binary_soft_classification given recall with threshold less than zero,ml/evaluate_data_frame/Test binary_soft_classification given confusion_matrix with threshold less than zero,ml/evaluate_data_frame/Test binary_soft_classification given precision with empty thresholds,ml/evaluate_data_frame/Test binary_soft_classification given recall with empty thresholds,ml/evaluate_data_frame/Test binary_soft_classification given confusion_matrix with empty thresholds,ml/evaluate_data_frame/Test regression given evaluation with empty metrics,ml/evaluate_data_frame/Test regression given missing actual_field,ml/evaluate_data_frame/Test regression given missing predicted_field,ml/delete_job_force/Test cannot force delete a non-existent job,ml/delete_model_snapshot/Test delete snapshot missing snapshotId,ml/delete_model_snapshot/Test delete snapshot missing job_id,ml/delete_model_snapshot/Test delete with in-use model,ml/filter_crud/Test create filter api with mismatching body ID,ml/filter_crud/Test create filter given invalid filter_id,ml/filter_crud/Test get filter API with bad ID,ml/filter_crud/Test invalid param combinations,ml/filter_crud/Test non-existing filter,ml/filter_crud/Test update filter given remove item is not present,ml/filter_crud/Test get all filter given index exists but no mapping for filter_id,ml/get_datafeed_stats/Test get datafeed stats given missing datafeed_id,ml/get_datafeeds/Test get datafeed given missing datafeed_id,ml/jobs_crud/Test cannot create job with existing categorizer state document,ml/jobs_crud/Test cannot create job with existing quantiles document,ml/jobs_crud/Test cannot create job with existing result document,ml/jobs_crud/Test cannot create job with model snapshot id set,ml/jobs_crud/Test cannot decrease model_memory_limit below current usage,ml/jobs_crud/Test get job API with non existing job id,ml/jobs_crud/Test put job after closing results index,ml/jobs_crud/Test put job after closing state index,ml/jobs_crud/Test put job with inconsistent body/param ids,ml/jobs_crud/Test put job with time field in analysis_config,ml/jobs_crud/Test put job with duplicate detector configurations,ml/jobs_crud/Test job with categorization_analyzer and categorization_filters,ml/jobs_get/Test get job given missing job_id,ml/jobs_get_result_buckets/Test mutually-exclusive params,ml/jobs_get_result_buckets/Test mutually-exclusive params via body,ml/jobs_get_result_categories/Test with invalid param combinations,ml/jobs_get_result_categories/Test with invalid param combinations via body,ml/jobs_get_result_overall_buckets/Test overall buckets given missing job,ml/jobs_get_result_overall_buckets/Test overall buckets given non-matching expression and not allow_no_jobs,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is 0,ml/jobs_get_result_overall_buckets/Test overall buckets given top_n is negative,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid start param,ml/jobs_get_result_overall_buckets/Test overall buckets given invalid end param,ml/jobs_get_result_overall_buckets/Test overall buckets given bucket_span is smaller than max job bucket_span,ml/jobs_get_stats/Test get job stats given missing job,ml/jobs_get_stats/Test no exception on get job stats with missing index,ml/job_groups/Test put job with empty group,ml/job_groups/Test put job with group that matches an job id,ml/job_groups/Test put job with group that matches its id,ml/job_groups/Test put job with id that matches an existing group,ml/job_groups/Test put job with invalid group,ml/ml_info/Test ml info,ml/post_data/Test Flush data with 
elasticmachine commented 5 years ago

Pinging @elastic/ml-core

przemekwitek commented 5 years ago

I tried to reproduce this failure locally, with no success.

FTR, these are the steps I've taken:

  1. I've run this test multiple times: a) using -Dtests.iter:

    ./gradlew :x-pack:plugin:integtest -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT -Dtests.method="test {p0=ml/data_frame_analytics_memory_usage_estimation/*}" -Dtests.iters=10000

    b) using bash for-loop (as I was unsure whether -Dtests.iters really does what I thought it should do):

    for i in `seq 1 100`; do ./gradlew :x-pack:plugin:integtest -Dtests.class=org.elasticsearch.xpack.test.rest.XPackRestIT -Dtests.method="test {p0=ml/data_frame_analytics_memory_usage_estimation/*}" 1>/tmp/test_${i}.out 2>&1; done
  2. I've started the server and flooded it with requests: a) start server locally:

    ./gradlew run -Dtests.es.xpack.license.self_generated.type=trial -Dtests.es.xpack.security.enabled=false -Dtests.heap.size=4g | tee /tmp/es_logs_1

b) populate index:

for i in `seq 1 1000`; do curl -X PUT -H Content-Type:application/json localhost:9200/my-index/_doc/${i} -d '{"a": 10, "b": 20 }'| json_pp; done

c) issue requests:

for i in `seq 1 10000`; do curl -X POST -H Content-Type:application/json localhost:9200/_ml/data_frame/analytics/_estimate_memory_usage -d '{ "source": { "index": ["my-index"] }, "analysis": { "outlier_detection" :{} } }' | json_pp --json_opt=canonical,pretty; done

100% of test runs and requests were successful, no errors in the logs.

przemekwitek commented 5 years ago

Update: I was able to reproduce the issue locally after adding Thread.sleep(10000); right before process.isProcessAlive() check. This basically means that the process closes as soon as it produces output and it does not wait until its output is read by Java code. The reason why memory estimation is the only C++ process that behaves like this is because we do not provide an input pipe to it. One possible solution would be to provide input pipe just like every other C++ process gets. I'd like to explore other options though as memory estimation process does not need to receive any input from Java pipe so adding an input pipe only to control when the process stops feels overly complex.

przemekwitek commented 5 years ago

I'll monitor build-stats console to verify that the failures did not reoccur: https://build-stats.elastic.co/app/kibana#/discover?_g=(refreshInterval:(pause:!t,value:0),time:(from:now-6M,mode:quick,to:now))&_a=(columns:!(test.failed-testcases),index:b646ed00-7efc-11e8-bf69-63c8ef516157,interval:auto,query:(language:lucene,query:'memory_usage_estimation_*'),sort:!(process.time-start,desc))

przemekwitek commented 5 years ago

The failures did not reoccur since 30th of August which suggests the fix worked properly.