[ML] Cloned multi metric jobs have different swimlane colours

elasticmachine commented 6 years ago

Original comment by @pheyos:

Versions:

5.6.7 BC1 - LINK REDACTED
6.1.3 BC1 - LINK REDACTED
6.2.0 BC1 - LINK REDACTED

Browser: Firefox 58.0

Steps to reproduce:

Use snapshot ml-elastic-data/_snapshots[5|6]/cloudwatch2016_snapshot
Create multi metric job: Mean(CPUUtilization) / split by instance / bucket span 15m
Clone this job
Compare swimlane colours in "Anomaly Explorer" => Some lanes have different colours

Additional information:

Influencer scores are the same (checked in Discover)
Example job config for 6.1.3:

{
  "job_id": "cw_multi_1",
  "job_type": "anomaly_detector",
  "job_version": "6.1.3",
  "groups": [
    "manual_ui_tests"
  ],
  "description": "cw multi 1",
  "create_time": 1516803906376,
  "finished_time": 1516805201848,
  "established_model_memory": 3307064,
  "analysis_config": {
    "bucket_span": "15m",
    "detectors": [
      {
        "detector_description": "mean(CPUUtilization)",
        "function": "mean",
        "field_name": "CPUUtilization",
        "partition_field_name": "instance",
        "detector_rules": [

        ],
        "detector_index": 0
      }
    ],
    "influencers": [
      "instance"
    ]
  },
  "analysis_limits": {
    "model_memory_limit": "14mb"
  },
  "data_description": {
    "time_field": EMAIL REDACTED
    "time_format": "epoch_ms"
  },
  "model_snapshot_retention_days": 1,
  "model_snapshot_id": "1516803949",
  "results_index_name": "shared",
  "data_counts": {
    "job_id": "cw_multi_1",
    "processed_record_count": 1793481,
    "processed_field_count": 2050056,
    "input_bytes": 100928963,
    "input_field_count": 2050056,
    "invalid_date_count": 0,
    "missing_field_count": 1536906,
    "out_of_order_timestamp_count": 0,
    "empty_bucket_count": 0,
    "sparse_bucket_count": 0,
    "bucket_count": 1398,
    "earliest_record_timestamp": 1477612800000,
    "latest_record_timestamp": 1478871060000,
    "last_data_time": 1516803948950,
    "input_record_count": 1793481
  },
  "model_size_stats": {
    "job_id": "cw_multi_1",
    "result_type": "model_size_stats",
    "model_bytes": 3307064,
    "total_by_field_count": 79,
    "total_over_field_count": 0,
    "total_partition_field_count": 78,
    "bucket_allocation_failures_count": 0,
    "memory_status": "ok",
    "log_time": 1516805201000,
    "timestamp": 1478870100000
  },
  "datafeed_config": {
    "datafeed_id": "datafeed-cw_multi_1",
    "job_id": "cw_multi_1",
    "query_delay": "65630ms",
    "indices": [
      "cloudwatch*"
    ],
    "types": [

    ],
    "query": {
      "match_all": {
        "boost": 1
      }
    },
    "scroll_size": 1000,
    "chunking_config": {
      "mode": "auto"
    },
    "state": "stopped"
  },
  "state": "closed"
}

elasticmachine commented 6 years ago

Original comment by @peteharverson:

I can reproduce this on a 7.0.0 snapshot on clones of cloudwatch jobs using the same job configuration as above. The result_type influencer docs look identical between the two cloned jobs (looking at times and influencer_score values) and yet running the aggregation used by the 'view by' swimlane in the Kibana console returns different results for some instances between the two jobs.

Aggregation run by the 'view by' swimlane is of the form:

   "aggs":{
      "influencerFieldValues":{
         "terms":{
            "field":"influencer_field_value",
            "size":10,
            "order":{
               "maxAnomalyScore":"desc"
            }
         },
         "aggs":{
            "maxAnomalyScore":{
               "max":{
                  "field":"influencer_score"
               }
            },
            "byTime":{
               "date_histogram":{
                  "field":"timestamp",
                  "interval":"28800s",
                  "min_doc_count":1
               },
               "aggs":{
                  "maxAnomalyScore":{
                     "max":{
                        "field":"influencer_score"
                     }
                  }
               }
            }
         }
      }
   }

elasticmachine commented 6 years ago

Original comment by @dimitris-athanasiou:

This is a consequence of the way sorting terms aggregations work. They are inaccurate as only the top-x docs of each shard are considered. Then any nested aggs operate on only the subset of docs that was returned from the terms/order agg. This is by design from the elasticsearch side.

A way to improve the stability of the results is to split this into two separate requests. The first request will simply find the top-10 terms over all time. The second request will filter on the top-10 terms and then simply find the max score per time bucket. This way, the second query will correctly operate on all docs for the top-10 terms. However, comparison between different jobs might still vary as the first request could return different top-10 terms between jobs.

pheyos commented 6 years ago

This does not only happen on job cloning but also when changing the limit in the Anomaly Explorer for one job (screenshots taken on 6.4.0-BC4, with .ml-anomalies-shared having 5 primary / 0 replica shards): anomaly_explorer_different_colors_based_on_limit

elastic / kibana

[ML] Cloned multi metric jobs have different swimlane colours #18129