Closed kayakalison closed 1 month ago
Hi there,
The algorithm used for DP random forest is sufficiently different from non-private random forest that the accuracy will not converge as epsilon approaches infinity. The closest you can get to the DP algorithm in sklearn is to use ExtraTreesClassifier
(from sklearn.ensemble
) with bootstrap=True
. This is a more randomised version of Random Forest (where the splitting is more randomised), but it is still more data-dependent than our DP implementation, so the performance will be better than our algorithm with epsilon=infinity. Because of the small number of examples in the dataset (569) and comparatively large features (30), the performance of the DP model will be variable.
You can improve the performance of the DP algorithm by varying the max_depth parameter. For this dataset, it seems max_depth=3
gives the best performance.
Is it true also that the diffprivlib DT should not be used independently of diffprivlib RF? I found that note in the code so assume so but would like to double check as that's also off.
Thanks so much for the feedback, and also for the really cool toolset! I'm writing my masters thesis on the impact of imbalanced data on DP using your library and I'm really liking it. :-)
Although you are able to use diffprivlib's DT on their own, you will likely find their accuracy to be poor. The main strength of this type of DT is when ensembled together, like for random forest.
Thanks for using diffprivlib and for the feedback! I'm glad it's of use for your thesis :)
Describe the bug The DP-RF classification is dramatically worse than non-DP version (sklearn) even with epsilon=np.inf. Apologies if I'm misunderstanding how this should work or if I'm missing something in my code, but I have recreated the issue with a standard dataset to try to explore it deeper. Is this a bug or is my code at fault somehow? Thanks so much for your time!
To Reproduce The following is my python code:
Expected behavior I expect the DP-RF classifier's MCC to converge on the non-DP version's MCC as epsilon nears infinity. This works properly for the Gaussian NB classifier but something seems to be off for RF. Instead of being in the 0.8-0.9 range it seems to level off around 0.66-0.67. Or have I done something wrong?
Screenshots
System information (please complete the following information):