Open RostislavStoyanov opened 4 weeks ago
Thank you for raising the issue.
I just did a simple test. I think different models and imprecise measuring caused the change in observed output. I generated a sample model using the latest XGBoost and ran the SHAP prediction using the latest and the 1.7 branches. The results from both runs are consistent, with peak memory around 4.8-4.9GB.
Following is the screenshot of nsight-system
with the 1.7 branch:
A peak memory usage might not be captured by running nvidia-smi
periodically. As shown in the screenshot, the memory usage came back down after a spike. One needs to capture that spike to see the actual memory usage correctly.
Thank you for the answer. I will keep this in mind for any future cases. I will be closing this issue. Once again, sorry for wasting your time and thank you.
Hi again @trivialfis,
I've rerun the tests based upon your feedback. I do agree that there is no degradation between 1.7.6 and 2.1.2 versions of the library, however, I still do find such degradation between 1.4.2 and later versions.
What I have done is to have a script for training and saving a model and then another script that is profiled, which simply loads the saved model and calculates shap values. Here are the results with scripts also provided below:
-1.4.2 peak: -1.4.2 sustained:
-1.7.6 peak: -1.7.6 sustained:
-2.1.2 peak: -2.1.2 sustained:
And here are the scripts: Training:
from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, __version__ as xgb_version
def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
# fetch dataset
diabetes_binary = fetch_ucirepo(id=891)
# data (as pandas dataframes)
X = diabetes_binary.data.features
y = diabetes_binary.data.targets
return X, y
def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
return X_train, X_test, y_train, y_test
def train_save_model(X_train: pd.DataFrame, y_train: pd.DataFrame) -> XGBClassifier:
# train a model
xgb_params = {
"objective": "binary:logistic",
"n_estimators": 2000,
"max_depth": 13,
"learning_rate": 0.1,
"tree_method": "gpu_hist",
}
model = XGBClassifier(**xgb_params)
model.fit(X_train, y_train["Diabetes_binary"])
model.save_model("xgb_model.json")
return model
if __name__ == '__main__':
if xgb_version != '1.4.2':
print("Training only on 1.4.2.")
exit(1)
X, y = download_data()
X_train, X_test, y_train, y_test = prep_dataset(X, y)
model = train_save_model(X_train, y_train)
Shap calc:
from typing import Tuple
import pandas as pd
from ucimlrepo import fetch_ucirepo
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier, DMatrix, __version__ as xgb_version
def download_data() -> Tuple[pd.DataFrame, pd.DataFrame]:
# fetch dataset
diabetes_binary = fetch_ucirepo(id=891)
# data (as pandas dataframes)
X = diabetes_binary.data.features
y = diabetes_binary.data.targets
return X, y
def prep_dataset(X: pd.DataFrame, y: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
# split dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
return X_train, X_test, y_train, y_test
def load_model(file_path: str) -> XGBClassifier:
model = XGBClassifier()
model.load_model(file_path)
return model
def call_shap_values(model: XGBClassifier, test_data: pd.DataFrame) -> pd.DataFrame:
booster = model.get_booster()
booster.set_param({"predictor": "gpu_predictor"})
dmatrix = DMatrix(test_data)
shap_values = booster.predict(dmatrix, pred_contribs=True)
shap_values_df = pd.DataFrame(shap_values[:, :-1], columns=test_data.columns)
shap_values_df["base_value"] = shap_values[:, -1]
shap_values_df.to_csv(f"shap_values_{xgb_version}.csv", index=False)
return shap_values_df
if __name__ == '__main__':
X, y = download_data()
X_train, X_test, y_train, y_test = prep_dataset(X, y)
model = load_model("./xgb_model.json")
calc_save_shap_vals = call_shap_values(model, X_test)
As you can probably see from the screenshots these tests were run against a Win10 machine, as I had some trouble running nsight-system on the remote instance I previously used.
I've found out that this problem appears from as soon as the 1.5.0 version. Looking at the patch notes there is the following sentence "Most of the other features, including prediction, SHAP value computation, feature importance, and model plotting were revised to natively handle categorical splits.", which might be the origin of the issue?
In any case, please inform me if there is an issue with the testing methodology, I think it is more precise this time.
As a side note, I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?
Thank you for sharing the info and reminding me of the categorical feature support. Yes, I can confirm the memory usage increase and it is indeed caused by categorical support. Specifically, this member variable https://github.com/dmlc/xgboost/blob/197c0ae7ef0cb045107f1c9f70eeaf6c060b9dca/src/predictor/gpu_predictor.cu#L427 It's used in the SHAP trace path, when the tree is deep, this caused a non-trivial amount of memory usage. We might want to make this optional.
I have another question -- isn't using nsight-system equilvalent (for the purposes of memory usage measurement) to calling nvidia-smi with high enough frequency (like 10kHZ) and logging the results?
Probably, the underlying mechanism of event sampling is beyond my knowledge.
I've noticed that using pred_contribs to generate shap values takes significantly more gpu memory in XGBoost 2.1.1 vs 1.4.2. This can lead to having issues with generating shap values, where no issue was previously present.
GPU memory comparison: 1.4.2 - 3090 1.7.6 - 4214 2.1.1 - 5366
Short example used to demonstrate:
with the following bash script used for generating memory usage:
All tests run on Ubuntu 20.04.6 LTS. Requirements with only the xgb version (and the device/tree method parameters) being changed between tests: