Closed valeriy42 closed 3 years ago
@tveasey I traced the problem back to the PR #1156. My best guess that CBoostedTreeLeafNodeStatistics::estimateMemoryUsage
should be alignment-aware similar to how it happens in CDataFrame::estimateMemoryUsage
. WDYT?
Hi @tveasey . I looked into the underestimating memory issue and I couldn't figure it out. The change you suggested offline in CBoostedTreeImpl.h
didn't make the difference. I dug a little bit deeper, and the whole estimated memory for the DataFrame on the kibana_sample_data_flights
dataset is about a 1 mb with or without alignment. The memory requirement is estimated at about 36 mb. For the job below:
{
"description": "",
"source": {
"index": "kibana_sample_data_flights",
"query": {
"match_all": {}
}
},
"dest": {
"index": ""
},
"analyzed_fields": {
"includes": [
"AvgTicketPrice",
"Cancelled",
"Carrier",
"Dest",
"DestAirportID",
"DestCityName",
"DestCountry",
"DestRegion",
"DestWeather",
"DistanceKilometers",
"DistanceMiles",
"FlightDelay",
"FlightDelayMin",
"FlightDelayType",
"FlightNum",
"FlightTimeHour",
"FlightTimeMin",
"Origin",
"OriginAirportID",
"OriginCityName",
"OriginCountry",
"OriginRegion",
"OriginWeather",
"dayOfWeek"
]
},
"analysis": {
"regression": {
"dependent_variable": "AvgTicketPrice",
"num_top_feature_importance_values": 0,
"training_percent": 90
}
},
"model_memory_limit": "36mb",
"max_num_threads": 1
}
The final stats are:
{
"name": "E_DFTPMEstimatedPeakMemoryUsage",
"description": "The upfront estimate of the peak memory training the predictive model would use",
"value": 40197269
},
{
"name": "E_DFTPMPeakMemoryUsage",
"description": "The peak memory training the predictive model used",
"value": 92063248
},
{
"name": "E_DFTPMTimeToTrain",
"description": "The time it took to train the predictive model",
"value": 227777
},
{
"name": "E_DFTPMTrainedForestNumberTrees",
"description": "The total number of trees in the trained forest",
"value": 52
}
The memory requirement is estimated at about 36 mb
This is in the regime where we correct to account for pessimism. Referring to CBoostedTreeImpl::correctedMemoryUsage
I get that we should have estimated that the job could use up to 20 + (1024 - 20) * (36 - 20) / (179 - 20) = 121 MB. So don't we simply have a case where our correction is too large? It may be worth tweaking the message to say worst case memory estimate on a failure, but caveat it that this is likely to be an over estimate.
(It probably is worth keeping the correction for alignment on the data frame memory usage we discussed offline as well.)
We have the function
std::size_t CDataFrameAnalysisRunner::estimateMemoryUsage(std::size_t totalNumberRows,
std::size_t partitionNumberRows,
std::size_t numberColumns) const {
return core::CDataFrame::estimateMemoryUsage(
this->storeDataFrameInMainMemory(), totalNumberRows,
numberColumns + this->numberExtraColumns(), core::CAlignment::E_Aligned16) +
this->estimateBookkeepingMemoryUsage(m_NumberPartitions, totalNumberRows,
partitionNumberRows, numberColumns);
}
The change affects the part with this->numberExtraColumns()
which is responsible for only 1 MB out of 36. The correction is done within this->estimateBookkeepingMemoryUsage()
which we correct down from 121 MB to 35 MB.
Since #1156 was fixing the memory underestimation before it didn't introduce any new bugs. I verified that with the new suggested memory limit 178 MB the job finishes without any problems. Hence, I am closing this issue.
When running a regression of flight sample data ~with feature importance calculation activated~, the job fails in the
feature selection
phase (beforecoarse parameter search
) due to exceeding the memory limit. ~This error does not occur neither for regression without feature importance, nor for classification with feature importance calculated.~~This is a bug since feature importance should not affect memory requirements in the feature selection phase (or any phase, actually).~
EDIT: the failure also occurs with feature importance deactivated. Therefore, I removed the mention of feature importance from the description.