elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.91k stars 24.73k forks source link

[ML] DFA job gets stuck when no field except the dependent variable is included in the analysis #55593

Closed blookot closed 4 years ago

blookot commented 4 years ago

Elasticsearch version (bin/elasticsearch --version): 7.6.2

JVM version (java -version): running on ESS

Description of the problem including expected versus actual behavior:

i'm runnng a regression data frame analytics job and it stops at 50% (loading data is 100% and analyzing is 0%) can't understand why...

Steps to reproduce:

  1. load the csv file attached (rename it with csv extension)
  2. create the ml regression job on it (data analytics with this index source, all the rest default)
  3. start the job

Here is an example of ml job:

{
  "id": "test8",
  "description": "",
  "source": {
    "index": [
      "disk_usage"
    ],
    "query": {
      "match_all": {}
    },
    "_source": {
      "includes": [],
      "excludes": []
    }
  },
  "dest": {
    "index": "test8",
    "results_field": "ml"
  },
  "analysis": {
    "regression": {
      "dependent_variable": "disk_percent",
      "prediction_field_name": "disk_percent_prediction",
      "training_percent": 80,
      "randomize_seed": -2904501521181443000
    }
  },
  "analyzed_fields": {
    "includes": [],
    "excludes": []
  },
  "model_memory_limit": "100mb",
  "create_time": 1587562995031,
  "version": "7.6.2",
  "allow_lazy_start": false
}

logs don't tell anything:

2020-04-22 15:43:15 | instance-0000000000 | Created analytics with analysis type [regression]
-- | -- | --
  | 2020-04-22 15:43:17 | instance-0000000000 | Estimated memory usage for this analytics to be [18.2mb]
  | 2020-04-22 15:43:17 | instance-0000000000 | Starting analytics on node [{instance-0000000002}{3pArSzZmQpiVzw8sQqmcQA}{FUHdB0WDRU-gNIz1SpthHQ}{10.43.1.93}{10.43.1.93:19669}{l}{logical_availability_zone=zone-0, server_name=instance-0000000002.4e4d9d9dbfd3428da12363c78f9aa352, availability_zone=europe-west1-b, ml.machine_memory=1073741824, xpack.installed=true, instance_configuration=gcp.ml.1, ml.max_open_jobs=20, region=unknown-region}]
  | 2020-04-22 15:43:17 | instance-0000000000 | Started analytics
  | 2020-04-22 15:43:17 | instance-0000000002 | Creating destination index [test8]
  | 2020-04-22 15:43:18 | instance-0000000002 | Finished reindexing to destination index [test8]
  | 2020-04-22 15:59:06 | instance-0000000002 | Finished analysis
  | 2020-04-22 15:59:06 | instance-0000000000 | Stopped analytics

disk_usage.txt

elasticmachine commented 4 years ago

Pinging @elastic/ml-core (:ml)

dimitris-athanasiou commented 4 years ago

@blookot Could you please explain how you're indexing the data?

blookot commented 4 years ago

I'm loading the csv file using data visualizer @dimitris-athanasiou

dimitris-athanasiou commented 4 years ago

Thank you @blookot. I have reproduced the issue. You have uncovered a bug that is caused because there are no features in this dataset. There is only the dependent_variable.

I think there are 2 issues to fix here:

  1. I think _start API should fail to run when this is the case
  2. The c++ process shouldn't get stuck even if this is the case

We'll proceed to fix them both.

Once again, thank you for reporting this. It helps us make the feature better!

blookot commented 4 years ago

Hi @dimitris-athanasiou why can't we use timestamp as a feature? in my case it's a disk slowly filling, and i'd like to use regression and inference to predict when my disk is gonna be full. i can plot timestamp on x and disk usage on y and have a nice dot chart... i guess this falls into the single metric ML (temporal) with forecast...

blookot commented 4 years ago

PS. CPU is running at 100% (on my ML node) until I stop the job!

dimitris-athanasiou commented 4 years ago

Indeed, your use case is a time series analysis. You can use an anomaly detection job to model the data and then use the forecast feature in order to predict when the disk will be full.

Having said that, we're planning to revisit date features for data frame analytics jobs. We have not addressed them yet as they require special handling that we decided to defer until later in the project. This is not a promise that we'll support them though.

PS. CPU is running at 100% (on my ML node) until I stop the job!

Thanks for the note! I noticed that too. We'll make sure to fix this issue.

blookot commented 4 years ago

yes i've been playing (successfully) with single metric & forecast i thought dates are stored as long (like unix epoch) so I imagined a 2D dot plot with the regression based on my timestamp and disk usage... But I'll wait for it :-) thanks again @dimitris-athanasiou

image

tveasey commented 4 years ago

We have not addressed them yet as they require special handling that we decided to defer until later in the project.

Just to add to this, the regression model we use isn't immediately well suited to extrapolation, as needed for forecasting. To get it to work in this fashion needs some explicit handling in inference and also judicious feature creation. As @dimitris-athanasiou says, using this functionality to enhance our forecasting capabilities (particularly to include additional explanatory variables) is definitely something on the roadmap.