[ML] Incorrect peak memory estimation for regression

valeriy42 commented 4 years ago

When running a regression of flight sample data ~with feature importance calculation activated~, the job fails in the feature selection phase (before coarse parameter search) due to exceeding the memory limit. ~This error does not occur neither for regression without feature importance, nor for classification with feature importance calculated.~

~This is a bug since feature importance should not affect memory requirements in the feature selection phase (or any phase, actually).~

EDIT: the failure also occurs with feature importance deactivated. Therefore, I removed the mention of feature importance from the description.

valeriy42 commented 3 years ago

@tveasey I traced the problem back to the PR #1156. My best guess that CBoostedTreeLeafNodeStatistics::estimateMemoryUsage should be alignment-aware similar to how it happens in CDataFrame::estimateMemoryUsage. WDYT?

valeriy42 commented 3 years ago

Hi @tveasey . I looked into the underestimating memory issue and I couldn't figure it out. The change you suggested offline in CBoostedTreeImpl.h didn't make the difference. I dug a little bit deeper, and the whole estimated memory for the DataFrame on the kibana_sample_data_flights dataset is about a 1 mb with or without alignment. The memory requirement is estimated at about 36 mb. For the job below:

{
  "description": "",
  "source": {
    "index": "kibana_sample_data_flights",
    "query": {
      "match_all": {}
    }
  },
  "dest": {
    "index": ""
  },
  "analyzed_fields": {
    "includes": [
      "AvgTicketPrice",
      "Cancelled",
      "Carrier",
      "Dest",
      "DestAirportID",
      "DestCityName",
      "DestCountry",
      "DestRegion",
      "DestWeather",
      "DistanceKilometers",
      "DistanceMiles",
      "FlightDelay",
      "FlightDelayMin",
      "FlightDelayType",
      "FlightNum",
      "FlightTimeHour",
      "FlightTimeMin",
      "Origin",
      "OriginAirportID",
      "OriginCityName",
      "OriginCountry",
      "OriginRegion",
      "OriginWeather",
      "dayOfWeek"
    ]
  },
  "analysis": {
    "regression": {
      "dependent_variable": "AvgTicketPrice",
      "num_top_feature_importance_values": 0,
      "training_percent": 90
    }
  },
  "model_memory_limit": "36mb",
  "max_num_threads": 1
}

The final stats are:

    {
        "name": "E_DFTPMEstimatedPeakMemoryUsage",
        "description": "The upfront estimate of the peak memory training the predictive model would use",
        "value": 40197269
    },
    {
        "name": "E_DFTPMPeakMemoryUsage",
        "description": "The peak memory training the predictive model used",
        "value": 92063248
    },
    {
        "name": "E_DFTPMTimeToTrain",
        "description": "The time it took to train the predictive model",
        "value": 227777
    },
    {
        "name": "E_DFTPMTrainedForestNumberTrees",
        "description": "The total number of trees in the trained forest",
        "value": 52
    }

tveasey commented 3 years ago

The memory requirement is estimated at about 36 mb

This is in the regime where we correct to account for pessimism. Referring to CBoostedTreeImpl::correctedMemoryUsage I get that we should have estimated that the job could use up to 20 + (1024 - 20) * (36 - 20) / (179 - 20) = 121 MB. So don't we simply have a case where our correction is too large? It may be worth tweaking the message to say worst case memory estimate on a failure, but caveat it that this is likely to be an over estimate.

(It probably is worth keeping the correction for alignment on the data frame memory usage we discussed offline as well.)

valeriy42 commented 3 years ago

We have the function

std::size_t CDataFrameAnalysisRunner::estimateMemoryUsage(std::size_t totalNumberRows,
                                                          std::size_t partitionNumberRows,
                                                          std::size_t numberColumns) const {
    return core::CDataFrame::estimateMemoryUsage(
               this->storeDataFrameInMainMemory(), totalNumberRows,
               numberColumns + this->numberExtraColumns(), core::CAlignment::E_Aligned16) +
           this->estimateBookkeepingMemoryUsage(m_NumberPartitions, totalNumberRows,
                                                partitionNumberRows, numberColumns);
}

The change affects the part with this->numberExtraColumns() which is responsible for only 1 MB out of 36. The correction is done within this->estimateBookkeepingMemoryUsage() which we correct down from 121 MB to 35 MB.

valeriy42 commented 3 years ago

Since #1156 was fixing the memory underestimation before it didn't introduce any new bugs. I verified that with the new suggested memory limit 178 MB the job finishes without any problems. Hence, I am closing this issue.

elastic / ml-cpp

[ML] Incorrect peak memory estimation for regression #1487