IBM / differential-privacy-library

Diffprivlib: The IBM Differential Privacy Library
https://diffprivlib.readthedocs.io
MIT License
830 stars 200 forks source link

DP Random Forest Performance Issues #64

Closed jan0b closed 2 years ago

jan0b commented 2 years ago

Describe the bug

I encountered two issues while using the diffprivlib random forest classifier:

1) I want to compare the influence of epsilon on the performance of a dp random forest classifier with a non-dp random forest classifier. For the non-dp random forest I use sci-kit learn's RandomForestClassifier. The issue is, that even for extremly high epsilons (100 to 10.000) the diffprivlib random forest does not approximate the sci-kit random forests performance.

2) The diffprivlib random forest takes very long to train even when setting _'njobs=-1'. The sci-kit random forest trains for about 13/15s whereas the diffprivlib random forest trains for about 25/30 min. Shouldn't both trainings take about the same time?

To Reproduce

Please find my code below to reproduce the described behaviour. I use the Kaggle dataset from here: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset

# Imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import diffprivlib as dpl

from sklearn import ensemble
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import preprocessing

# Data handling
dataset = pd.read_csv('../cardio_train.csv', sep=';')
dataset.head()

X_raw = dataset.drop(['id', 'cardio'], axis=1)
X_raw_cat = X_raw.drop(['age', 'height', 'weight', 'ap_hi', 'ap_lo'], axis=1)
X_raw_cont = X_raw[['age', 'height', 'weight', 'ap_hi', 'ap_lo']]
y_raw = dataset['cardio']

normalize = preprocessing.StandardScaler()
ordinal = preprocessing.OrdinalEncoder()

X_cont = normalize.fit_transform(X_raw_cont)
X_cat = ordinal.fit_transform(X_raw_cat)

X = np.concatenate([X_cont, X_cat], axis=1)
y = y_raw

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Non-DP RF
model = ensemble.RandomForestClassifier(n_estimators=1000, oob_score=True, n_jobs=-1, random_state=1302)
model.fit(X_train, y_train)

y_train_pred = model.predict(X_train)
y_train_pred_oob = np.argmax(model.oob_decision_function_, axis=1)
y_test_pred = model.predict(X_test)

# DP RF
epsilon = 10000.0
model_dp = dpl.models.RandomForestClassifier(n_estimators=1000, epsilon=epsilon, n_jobs=-1).fit(X_train, y_train)

y_dp_train_pred = model_dp.predict(X_train)
y_dp_test_pred = model_dp.predict(X_test)

# Comparison
print(
metrics.accuracy_score(y_train, y_train_pred),
model.oob_score_,
metrics.accuracy_score(y_test, y_test_pred),
metrics.accuracy_score(y_train, y_dp_train_pred),
metrics.accuracy_score(y_test, y_dp_test_pred))

Expected behavior

The expected behaviour would be the dp-model approximating the non-dp models performance for high epsilons.

Screenshots

The models training times:

Bildschirmfoto 2022-05-05 um 16 25 33

The models performances:

Bildschirmfoto 2022-05-05 um 16 26 28

System information:

Zeyu-Shen commented 2 years ago

I also encountered this problem. Four months ago dp random forest seemed to work correctly.

roijalbaker commented 2 years ago

I also encounter these issues. I also noticed memory usage explodes compared to SKLearn forests and is not comparable with the increase of memory use (e.g.) from SKLearn to DiffPrivLib LogisticRegression.

naoise-h commented 2 years ago

We are in the process of re-engineering the Random Forest classifier in diffprivlib, and have looked at your issue again.

The DP implementation of random forest that we have used in diffprivlib achieves DP by first building a tree at random, using random splits on random features at each step without looking at the data. Each tree is then fit to the data and the classification at each leaf is calculated using DP. As a result, it is not a like-for-like comparison to compare it with the vanilla RandomForestClassifier in sklearn.

It more closely resembles the ExtraTreesClassifier of sklearn, when parametrised appropriately. For example, if we have a DP random forest parametrised as follows in diffprivlib:

dpl.models.RandomForestClassifier(n_estimators=1000, max_depth=5, epsilon=float("inf"), n_jobs=-1)

we can compare it to the sklearn equivalent as follows:

ExtraTreesClassifier(n_estimators=1000, n_jobs=-1, max_features=1, max_depth=5)

However, this is still not an exact like-for-like comparison, as ExtraTreesClassifier is still trained while looking at the data (for example, it still won't perform a split that will result in an empty node).

Nevertheless, I ran some experiments on your data using n_estimators=100 and max_depth=6 and got the following results (mean accuracy over 10 runs):

  1. ExtraTreesClassifier: 63.0%
  2. dpl.RandomForestClassifier(epsilon=float("inf")): 60.4%

From a performance perspective, there is also a slight penalty because Sklearn uses Cython to build the trees, which is faster than Python. However, it is no longer as extreme as you reported. These are the training times using the same parameters as above:

  1. ExtraTreesClassifier: 180ms
  2. dpl.RandomForestClassifier: 881ms