elastic / ml-cpp

Machine learning C++ code
Other
151 stars 62 forks source link

[ML] Memory estimates way too high for very simple analyses #1106

Closed droberts195 closed 4 years ago

droberts195 commented 4 years ago

Steps to reproduce:

  1. Import the Iowa liquor sales dataset using the file data visualizer (ping me if you need the file)
  2. Create a new regression data frame analytics job to analyze it, setting training percent to 5, dependent variable to Sale (Dollars) and excluding every variable except Bottle Volume (ml) and Store Number from the analysis. (So effectively we're predicting one number from two others on 5% of the 380000 rows, i.e. 19000 rows.)
Screenshot 2020-03-31 at 13 05 05
  1. Start the analysis and wait for it to finish.
  2. Look at the job details. The memory limit recommended by the UI was around 1.2GB. The actual memory required was less than 12MB.
Screenshot 2020-03-31 at 13 09 49

Part of the problem here is https://github.com/elastic/kibana/issues/60496, because the memory estimate didn't get updated when I added the exclude fields. However, a considerable part of the problem is in the C++ estimation code. If I run the estimate in dev console using the final config it's still 25 times bigger than it needs to be:

POST _ml/data_frame/analytics/_explain
{
    "source": {
      "index": [
        "iowa"
      ],
      "query": {
        "match_all": {}
      }
    },
    "analysis": {
      "regression": {
        "dependent_variable": "Sale (Dollars)",
        "prediction_field_name": "Sale (Dollars)_prediction",
        "training_percent": 5
      }
    },
    "analyzed_fields": {
      "includes": [],
      "excludes": [
        "Address",
        "Bottles Sold",
        "Category",
        "Category Name",
        "City",
        "County",
        "County Boundaries of Iowa",
        "Iowa ZIP Code Tabulation Areas",
        "Item Description",
        "Item Number",
        "Pack",
        "State Bottle Cost",
        "State Bottle Retail",
        "Store Location",
        "Store Name",
        "Iowa Watersheds (HUC 10)",
        "Iowa Watershed Sub-Basins (HUC 08)",
        "County Number",
        "Invoice/Item Number",
        "US Counties",
        "Zip Code",
        "Vendor Name",
        "Vendor Number",
        "Volume Sold (Gallons)",
        "Volume Sold (Liters)"
      ]
    }  
}

returns:

{
  "field_selection" : [
    ... blah ...
  ],
  "memory_estimation" : {
    "expected_memory_without_disk" : "306147kb",
    "expected_memory_with_disk" : "306147kb"
  }
}

And from the second screenshot you can see actual was 12322863 bytes ~= 12034kb.

This is a big problem for Cloud trials where users don't have much memory to play with, and we refuse to run an analysis if its memory estimate won't fit onto the available machine.

tveasey commented 4 years ago

This is partly a known issue: we need to communicate the training percentage to the memory estimation process, since this very significantly affects the actual memory usage.

droberts195 commented 4 years ago

After the fix of #1111 the estimate for a training percent of 5% on the Iowa liquor sales data dropped from 306147kb to 74672kb, a great improvement.

droberts195 commented 4 years ago

With a training percent of 80%, the estimate is currently 273319kb and the actual is 13109339 bytes.

tveasey commented 4 years ago

We've discussed this and we're going to work on calibrating the current worst case memory estimates based on a variety of different classification and regression runs.

tveasey commented 4 years ago

This was fixed in #1298.