Closed jan0b closed 2 years ago
I also encountered this problem. Four months ago dp random forest seemed to work correctly.
I also encounter these issues. I also noticed memory usage explodes compared to SKLearn forests and is not comparable with the increase of memory use (e.g.) from SKLearn to DiffPrivLib LogisticRegression.
We are in the process of re-engineering the Random Forest classifier in diffprivlib, and have looked at your issue again.
The DP implementation of random forest that we have used in diffprivlib achieves DP by first building a tree at random, using random splits on random features at each step without looking at the data. Each tree is then fit to the data and the classification at each leaf is calculated using DP. As a result, it is not a like-for-like comparison to compare it with the vanilla RandomForestClassifier in sklearn.
It more closely resembles the ExtraTreesClassifier
of sklearn, when parametrised appropriately. For example, if we have a DP random forest parametrised as follows in diffprivlib:
dpl.models.RandomForestClassifier(n_estimators=1000, max_depth=5, epsilon=float("inf"), n_jobs=-1)
we can compare it to the sklearn equivalent as follows:
ExtraTreesClassifier(n_estimators=1000, n_jobs=-1, max_features=1, max_depth=5)
However, this is still not an exact like-for-like comparison, as ExtraTreesClassifier
is still trained while looking at the data (for example, it still won't perform a split that will result in an empty node).
Nevertheless, I ran some experiments on your data using n_estimators=100
and max_depth=6
and got the following results (mean accuracy over 10 runs):
From a performance perspective, there is also a slight penalty because Sklearn uses Cython to build the trees, which is faster than Python. However, it is no longer as extreme as you reported. These are the training times using the same parameters as above:
Describe the bug
I encountered two issues while using the diffprivlib random forest classifier:
1) I want to compare the influence of epsilon on the performance of a dp random forest classifier with a non-dp random forest classifier. For the non-dp random forest I use sci-kit learn's RandomForestClassifier. The issue is, that even for extremly high epsilons (100 to 10.000) the diffprivlib random forest does not approximate the sci-kit random forests performance.
2) The diffprivlib random forest takes very long to train even when setting _'njobs=-1'. The sci-kit random forest trains for about 13/15s whereas the diffprivlib random forest trains for about 25/30 min. Shouldn't both trainings take about the same time?
To Reproduce
Please find my code below to reproduce the described behaviour. I use the Kaggle dataset from here: https://www.kaggle.com/datasets/sulianova/cardiovascular-disease-dataset
Expected behavior
The expected behaviour would be the dp-model approximating the non-dp models performance for high epsilons.
Screenshots
The models training times:
The models performances:
System information: