IBM / differential-privacy-library

Diffprivlib: The IBM Differential Privacy Library
https://diffprivlib.readthedocs.io
MIT License
820 stars 196 forks source link

DP-RF classification at epsilon=np.inf is dramatically worse than non-DP version (sklearn) #93

Closed kayakalison closed 1 month ago

kayakalison commented 3 months ago

Describe the bug The DP-RF classification is dramatically worse than non-DP version (sklearn) even with epsilon=np.inf. Apologies if I'm misunderstanding how this should work or if I'm missing something in my code, but I have recreated the issue with a standard dataset to try to explore it deeper. Is this a bug or is my code at fault somehow? Thanks so much for your time!

To Reproduce The following is my python code:

# Load the required packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
from sklearn.metrics import matthews_corrcoef
from sklearn.ensemble import RandomForestClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split
import diffprivlib as dp

# Load breast cancer dataset into a dataframe
dataset = datasets.load_breast_cancer()
data = pd.DataFrame(data=dataset.data, columns=dataset.feature_names)
data['target'] = dataset.target

# Calculate the bounds to prevent a privacy warning with DP-RF
min_values = data.min().tolist()[:-1]
max_values = data.max().tolist()[:-1]
bounds = (min_values, max_values)

# Define the classes to prevent a privacy warning with DP-RF
classes = np.array([0, 1],)

# Split into test and training sets
X = data.drop(['target'], axis=1)
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Calculate the MCC for the non-DP version of RF
rf = RandomForestClassifier(random_state=42)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
rf_MCC = matthews_corrcoef(y_test, y_pred_rf)
print(confusion_matrix(y_test, y_pred_rf))
print("MCC: ", rf_MCC)

# Build a Differentially Private Random Forest Classifier
epsilon = np.inf  # Example epsilon value for differential privacy

DPrf = dp.models.RandomForestClassifier(epsilon=epsilon, random_state=42, bounds=bounds, classes=classes)
DPrf.fit(X_train, y_train)
y_pred_DPrf = DPrf.predict(X_test)
y_pred_DPrf = y_pred_DPrf.astype(int)
DPinf_MCC = matthews_corrcoef(y_test, y_pred_DPrf)

# Output metrics
print(confusion_matrix(y_test, y_pred_DPrf))
print("MCC: ", DPinf_MCC)

# Repeat for a range of epsilon values and plot
# Define the range of epsilon values including for a non-private comparison
epsilons = np.logspace(-2, 2, 50)
num_runs = 30

# Initialize a DataFrame to store MCC values for each epsilon and each run
column_headers = [f'run {i+1}' for i in range(num_runs)]
DPrf_mcc = pd.DataFrame(index=epsilons, columns=column_headers)
DPrf_mcc.index.name = 'epsilon'

for run in range(num_runs):
    print(run) # for tracking
    for epsilon in epsilons:
        DPrf = dp.models.RandomForestClassifier(epsilon=epsilon, random_state=(42 + run), classes=classes, bounds=bounds)
        DPrf.fit(X_train, y_train)
        y_pred_DPrf = DPrf.predict(X_test)
        DPrf_mcc.at[epsilon, column_headers[run]] = matthews_corrcoef(y_test, y_pred_DPrf)

# Compute the average MCC across all runs
avg_DPrf_mcc = DPrf_mcc.mean(axis=1)

# Plot the results
plt.semilogx(epsilons[:], avg_DPrf_mcc, color='#001E82', label="Differentially Private")
plt.axhline(y=rf_MCC, color='#00C2DE', linestyle='--', label="Non-Private")
plt.title("Random Forest")
plt.xlabel("Privacy Risk (ε)")
plt.ylabel("MCC*")
plt.legend(loc="lower right")
plt.grid(which='major', color='#DDDDDD', linewidth=0.8)
plt.figtext(0.1, -0.05, '* The average MCC over 30 simulations for each value of ε.', horizontalalignment='left', wrap=True ) 

# Save and show the plot
#plt.savefig('Figures/RF-graph-0.png', dpi=400, bbox_inches='tight')
plt.show()

Expected behavior I expect the DP-RF classifier's MCC to converge on the non-DP version's MCC as epsilon nears infinity. This works properly for the Gaussian NB classifier but something seems to be off for RF. Instead of being in the 0.8-0.9 range it seems to level off around 0.66-0.67. Or have I done something wrong?

Screenshots RF-graph-0

System information (please complete the following information):

naoise-h commented 3 months ago

Hi there,

The algorithm used for DP random forest is sufficiently different from non-private random forest that the accuracy will not converge as epsilon approaches infinity. The closest you can get to the DP algorithm in sklearn is to use ExtraTreesClassifier (from sklearn.ensemble) with bootstrap=True. This is a more randomised version of Random Forest (where the splitting is more randomised), but it is still more data-dependent than our DP implementation, so the performance will be better than our algorithm with epsilon=infinity. Because of the small number of examples in the dataset (569) and comparatively large features (30), the performance of the DP model will be variable.

You can improve the performance of the DP algorithm by varying the max_depth parameter. For this dataset, it seems max_depth=3 gives the best performance.

kayakalison commented 3 months ago

Is it true also that the diffprivlib DT should not be used independently of diffprivlib RF? I found that note in the code so assume so but would like to double check as that's also off.

Thanks so much for the feedback, and also for the really cool toolset! I'm writing my masters thesis on the impact of imbalanced data on DP using your library and I'm really liking it. :-)

naoise-h commented 2 months ago

Although you are able to use diffprivlib's DT on their own, you will likely find their accuracy to be poor. The main strength of this type of DT is when ensembled together, like for random forest.

Thanks for using diffprivlib and for the feedback! I'm glad it's of use for your thesis :)